while read requirement; do conda install --yes $requirement; done < requirements.txt
while read requirement; do conda install --yes $requirement; done < requirements.txt
<hr>
<hr>
## Prepare required data
## Prepare required data
### Geonames data
### Geonames data
* download the Geonames data use to train the network [here](download.geonames.org/export/dump/)
1. Download the Geonames data use to train the network [here](download.geonames.org/export/dump/)
* download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip)
2. download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip)
* unzip both file in the directory of your choice
3. unzip both file in the directory of your choice
* run the script `train_test_split_geonames.py <geoname_filename>`
4. run the script `train_test_split_geonames.py <geoname_filename>`
### Cooccurence data
### Cooccurence data
* First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
5. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
* Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
6. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
* Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
7. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
* Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
8. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
* Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>`
9. Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>`
### If you're in a hurry
### If you're in a hurry
French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
...
@@ -52,11 +52,9 @@ French Geonames, French Wikipedia cooccurence data, and their train/test splits
...
@@ -52,11 +52,9 @@ French Geonames, French Wikipedia cooccurence data, and their train/test splits
## Train the network
## Train the network
The script `combination_embeddings.py` is the one responsible of the neural network training
To train the network with default parameter use the following command :
To train the network with default parameter use the following command :
python3 combination_embeddings.py -a -i <geoname data filename> <hierarchy geonames data filename>
python3 combination_embeddings.py -i <geoname data filename> <hierarchy geonames data filename>
### Available parameters
### Available parameters
...
@@ -73,4 +71,4 @@ To train the network with default parameter use the following command :
...
@@ -73,4 +71,4 @@ To train the network with default parameter use the following command :
| -t,--tolerance-value | K-value in the computation of the accuracy@k |
| -t,--tolerance-value | K-value in the computation of the accuracy@k |
| -e,--epochs | number of epochs |
| -e,--epochs | number of epochs |
| -d,--dimension | size of the ngram embeddings |
| -d,--dimension | size of the ngram embeddings |
| --admin_code_1 | (Optional) If you wish to train the network on a specificate region |
| --admin_code_1 | (Optional) If you wish to train the network on a specific region |