Skip to content
Snippets Groups Projects
user avatar
George Marchment authored
e44eea40

Github-Crawler

License: GPL v3

Zenodo doi badge

This branch corresponds to the results of the crawler for the study linked with BioFlow-Insight

Results of the crawler

The crawler gathered 677 open license Nextflow workflows. A static version of this corpus can be found here: https://zenodo.org/records/10817606.

Description

This repository contains the code composing a crawler adapted to search for Nextflow workflows on Github.

The GitHub Search API has a custom rate limit. User-to-server requests are limited to 5,000 requests per hour per authenticated user. The crawler performs the 5,000 requests linearly and is not limited by time; however, after the 5,000 requests, the crawler waits for an hour.

The GitHub Search API provides up to 1,000 results for each search, which can be problematic when attempting to compile a large corpus, such as workflows. To circumvent this limitation, the global request to retrieve Nextflow repositories is subdivided into multiple requests, each focusing on a specific time frame, such as retrieve Nextflow repositories between date 1 and date 2. This process involves incrementing the months and years automatically, ensuring comprehensive coverage of the desired repositories.

While the crawler is running, the data is saved in a JSON file.

It's important to note that due to GitHub's search functionality, the results of a search may not be robust, thus the crawler's results may not be reproducible.

The current version of the crawler is adapted to work on Nextflow workflows and workflows with at least one Nextflow file at the root of the project. However, it is easy to adapt the crawler to a more generic functionality.

Table of Contents

Installation

The python function dependancies are described in the requirements.txt file.

License

This project is licensed under the GNU Affero General Public License.