Github-Crawler
This branch corresponds to the results of the crawler for the study linked with BioFlow-Insight
Results of the crawler
The crawler gathered 677 open license Nextflow workflows. A static version of this corpus can be found here: https://zenodo.org/records/10817606.
Description
This repository contains the code composing a crawler adapted to search for Nextflow workflows on Github.
The GitHub Search API has a custom rate limit. User-to-server requests are limited to 5,000 requests per hour per authenticated user. The crawler performs the 5,000 requests linearly and is not limited by time; however, after the 5,000 requests, the crawler waits for an hour.
The GitHub Search API provides up to 1,000 results for each search, which can be problematic when attempting to compile a large corpus, such as workflows. To circumvent this limitation, the global request to retrieve Nextflow repositories
is subdivided into multiple requests, each focusing on a specific time frame, such as retrieve Nextflow repositories between date 1 and date 2
. This process involves incrementing the months and years automatically, ensuring comprehensive coverage of the desired repositories.
While the crawler is running, the data is saved in a JSON file.
It's important to note that due to GitHub's search functionality, the results of a search may not be robust, thus the crawler's results may not be reproducible.
The current version of the crawler is adapted to work on Nextflow workflows and workflows with at least one Nextflow file at the root of the project. However, it is easy to adapt the crawler to a more generic functionality.
Table of Contents
Installation
The python function dependancies are described in the requirements.txt
file.
License
This project is licensed under the GNU Affero General Public License.




