@@ -96,8 +96,8 @@ The data preparation is divided into three steps. First, we retrieve required da
...
@@ -96,8 +96,8 @@ The data preparation is divided into three steps. First, we retrieve required da
1. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
1. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
2. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
2. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
3. Build a page of interest file that contains a list of Wikipedia pages. Use the script `extract_pages_of_interest.py` for that. You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/pages_of_interest/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
3. Build a page of interest file that contains a list of Wikipedia pages. Use the script `extract_pages_of_interest.py` for that. You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/pages_of_interest/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in FR or EN wikipedia.
4. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
4. Then, using the page of interest file, run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <final_output_name> -c <1stoutputname>.json.gz`
### Generate dataset
### Generate dataset
...
@@ -113,6 +113,7 @@ Use the following command to generate the datasets for training your model.
...
@@ -113,6 +113,7 @@ Use the following command to generate the datasets for training your model.
| --adj-nside | Healpix resolution where places within are considered adjacent |
| --adj-nside | Healpix resolution where places within are considered adjacent |
| --split-nside | Size of the zone where the train/test split are done |
| --split-nside | Size of the zone where the train/test split are done |
| --split-method | [per_pair\|per_entity] Split each dataset based on places (place cannot exists in both train and test) or pairs(place can appears in train and test) |
| --split-method | [per_pair\|per_entity] Split each dataset based on places (place cannot exists in both train and test) or pairs(place can appears in train and test) |
| --no-sampling | To avoid sampling in generated pairs |
### If you're in a hurry
### If you're in a hurry
...
@@ -123,7 +124,7 @@ French (also GB,US) Geonames, French (also GB,US) Wikipedia co-occurrence data,
...
@@ -123,7 +124,7 @@ French (also GB,US) Geonames, French (also GB,US) Wikipedia co-occurrence data,
To train the first model use the following command :
To train the first model use the following command :