# Named entity recognition on Workflow data

## Description 

This directory contains all the necessary information and scripts to reproduce the results presented in : 

```
@misc{sebe2024extractinginformationlowresourcesetting,
      title={Extracting Information in a Low-resource Setting: Case Study on Bioinformatics Workflows}, 
      author={Clémence Sebe and Sarah Cohen-Boulakia and Olivier Ferret and Aurélie Névéol},
      year={2024},
      eprint={2411.19295},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2411.19295}, 
}
```

This paper is accepted to IDA 2025.

## Before You Start 

Before running the experiments, you need to:

* Download the dataset : https://doi.org/10.5281/zenodo.14879025
* Clone this Git repository for the experiments with Nlstruct : https://github.com/ClemenceS/nlstruct
* Clone this Git repository for auto-regressive experiments : https://github.com/ClemenceS/autoregressive_ner 

## Contents

This repository includes:

* A python script, `run_nlstruct.py`, to launch NER experiences whose header information must be modified (data link and model to be trained)
* A jupyter notebook, `add_voc_bioinfo.ipynb`, to integrate bioinformatics tools and binaries into models.

## Licence 

This project is licensed under the [Apache License, Version 2.0](https://www.apache.org/licenses/LICENSE-2.0)

## Funding 

This work received support from the National Research Agency under the France 2030 program, with reference to ANR-22-PESN-0007.