Skip to content
Snippets Groups Projects
Commit 33651e79 authored by Jacques Fize's avatar Jacques Fize
Browse files

Change README

parent 5a46b9f9
No related branches found
No related tags found
No related merge requests found
...@@ -26,24 +26,24 @@ For Anaconda users ...@@ -26,24 +26,24 @@ For Anaconda users
while read requirement; do conda install --yes $requirement; done < requirements.txt while read requirement; do conda install --yes $requirement; done < requirements.txt
<hr> <hr>
## Prepare required data ## Prepare required data
### Geonames data ### Geonames data
* download the Geonames data use to train the network [here](download.geonames.org/export/dump/) 1. Download the Geonames data use to train the network [here](download.geonames.org/export/dump/)
* download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip) 2. download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip)
* unzip both file in the directory of your choice 3. unzip both file in the directory of your choice
* run the script `train_test_split_geonames.py <geoname_filename>` 4. run the script `train_test_split_geonames.py <geoname_filename>`
### Cooccurence data ### Cooccurence data
* First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2) 5. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
* Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz` 6. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
* Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia. 7. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
* Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz` 8. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
* Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>` 9. Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>`
### If you're in a hurry ### If you're in a hurry
French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/) French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
...@@ -52,11 +52,9 @@ French Geonames, French Wikipedia cooccurence data, and their train/test splits ...@@ -52,11 +52,9 @@ French Geonames, French Wikipedia cooccurence data, and their train/test splits
## Train the network ## Train the network
The script `combination_embeddings.py` is the one responsible of the neural network training
To train the network with default parameter use the following command : To train the network with default parameter use the following command :
python3 combination_embeddings.py -a -i <geoname data filename> <hierarchy geonames data filename> python3 combination_embeddings.py -i <geoname data filename> <hierarchy geonames data filename>
### Available parameters ### Available parameters
...@@ -73,4 +71,4 @@ To train the network with default parameter use the following command : ...@@ -73,4 +71,4 @@ To train the network with default parameter use the following command :
| -t,--tolerance-value | K-value in the computation of the accuracy@k | | -t,--tolerance-value | K-value in the computation of the accuracy@k |
| -e,--epochs | number of epochs | | -e,--epochs | number of epochs |
| -d,--dimension | size of the ngram embeddings | | -d,--dimension | size of the ngram embeddings |
| --admin_code_1 | (Optional) If you wish to train the network on a specificate region | | --admin_code_1 | (Optional) If you wish to train the network on a specific region |
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment