From 33651e79d5ec6c03199db5d1d8989dd1842ad00c Mon Sep 17 00:00:00 2001 From: Jacques Fize <jacques.fize@insa-lyon.fr> Date: Fri, 14 Feb 2020 16:16:52 +0100 Subject: [PATCH] Change README --- README.md | 26 ++++++++++++-------------- 1 file changed, 12 insertions(+), 14 deletions(-) diff --git a/README.md b/README.md index 6bdd557..b43d729 100644 --- a/README.md +++ b/README.md @@ -26,24 +26,24 @@ For Anaconda users while read requirement; do conda install --yes $requirement; done < requirements.txt - <hr> ## Prepare required data ### Geonames data - * download the Geonames data use to train the network [here](download.geonames.org/export/dump/) - * download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip) - * unzip both file in the directory of your choice - * run the script `train_test_split_geonames.py <geoname_filename>` + 1. Download the Geonames data use to train the network [here](download.geonames.org/export/dump/) + 2. download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip) + 3. unzip both file in the directory of your choice + 4. run the script `train_test_split_geonames.py <geoname_filename>` ### Cooccurence data - * First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2) - * Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz` - * Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia. - * Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz` - * Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>` + 5. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2) + 6. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz` + 7. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia. + 8. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz` + 9. Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>` + ### If you're in a hurry French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/) @@ -52,11 +52,9 @@ French Geonames, French Wikipedia cooccurence data, and their train/test splits ## Train the network -The script `combination_embeddings.py` is the one responsible of the neural network training - To train the network with default parameter use the following command : - python3 combination_embeddings.py -a -i <geoname data filename> <hierarchy geonames data filename> + python3 combination_embeddings.py -i <geoname data filename> <hierarchy geonames data filename> ### Available parameters @@ -73,4 +71,4 @@ To train the network with default parameter use the following command : | -t,--tolerance-value | K-value in the computation of the accuracy@k | | -e,--epochs | number of epochs | | -d,--dimension | size of the ngram embeddings | -| --admin_code_1 | (Optional) If you wish to train the network on a specificate region | +| --admin_code_1 | (Optional) If you wish to train the network on a specific region | -- GitLab