change Readme

77e03390 · Fize Jacques · 13db2a5a · 77e03390
Commit 77e03390 authored 4 years ago by Fize Jacques
--- a/README.md
+++ b/README.md
@@ -94,10 +94,10 @@ The data preparation is divided into three steps. First, we retrieve required da

 ### Cooccurence data

- 5. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
- 6. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
- 7. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
- 8. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
+ 1. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
+ 2. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
+ 3. Build a page of interest file that contains a list of Wikipedia pages. Use the script `extract_pages_of_interest.py` for that. You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/pages_of_interest/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
+ 4. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`

 ### Generate dataset

@@ -116,7 +116,7 @@ Use the following command to generate the datasets for training your model.

 ### If you're in a hurry

-French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
+French (also GB,US) Geonames, French (also GB,US) Wikipedia co-occurrence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)


 ## Our model