Skip to content
Snippets Groups Projects
user avatar
authored

Toponym Geocoding

Use of ngram representation and colocation of toponyms in geography and text for geocoding

Figure 1 : General workflow


Setup environnement

  • Python3.6+
  • Os free (all dependencies should work on Windows !)

It is strongly advised to used Anaconda in a windows environnement!

Install dependencies

pip3 install -r requirements.txt

For Anaconda users

while read requirement; do conda install --yes $requirement; done < requirements.txt

Prepare required data

Geonames data

  1. Download the Geonames data use to train the network here
  2. download the hierarchy data here
  3. unzip both file in the directory of your choice
  4. run the script train_test_split_geonames.py <geoname_filename>

Cooccurence data

  1. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : English Wikipedia Corpus
  2. Parse the corpus with Gensim script using the following command : python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz
  3. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.
    You can find here a page of interest file that contains places that appears in both FR and EN wikipedia.
  4. Then using and index that contains pages of interest run the command : python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz
  5. Finally, split the resulting dataset with the script train_test_split_cooccurrence_data.py <2ndoutputname>

If you're in a hurry

French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : https://projet.liris.cnrs.fr/hextgeo/files/


Train the network

To train the network with default parameter use the following command :

python3 combination_embeddings.py -i <geoname data filename> <hierarchy geonames data filename>

### Train the network with different parameters

We built a tiny module that allows to run the network training using different parameters. To do that use the GridSearchModel class in lib.run. You can find an example in the following code:

from lib.run import GridSearchModel
from collections import OrderedDict

grid = GridSearchModel(\
    "python3 combination_embeddings.py",
    **OrderedDict({ # We use an OrderedDict since the order of parameters is important
    "rel":["-i","-a","-c"],
    "-n":[4],
    "geoname_fn":"../data/geonamesData/US_FR.txt".split(),
    "hierarchy_fn":"../data/geonamesData/hierarchy.txt".split(),
    "store_true":["rel"]
    }.items()))
grid.run()

Available parameters

Parameter Description
-i,--inclusion Use inclusion relationships to train the network
-a,--adjacency Use adjacency relationships to train the network
-w,--wikipedia-coo Use Wikipedia place co-occurrences to train the network
--wikipedia-cooc-fn File that contains the coooccurrence data
--cooc-sample-size- Number of cooccurence relation selected for each location in cooccurrences data
--adjacency-iteration Number of iteration in the adjacency extraction process
-n,--ngram-size ngram size x
-t,--tolerance-value K-value in the computation of the accuracy@k
-e,--epochs number of epochs
-d,--dimension size of the ngram embeddings
--admin_code_1 (Optional) If you wish to train the network on a specific region