Skip to content
Snippets Groups Projects

Toponym Geocoding

Use of ngram representation and colocation of toponyms in geography and text for geocoding

Figure 1 : General workflow


Setup environnement

  • Python3.6+
  • Os free (all dependencies should work on Windows !)

It is strongly advised to used Anaconda in a windows environnement!

Install dependencies

pip3 install -r requirements.txt

For Anaconda users

while read requirement; do conda install --yes $requirement; done < requirements.txt

Prepare required data

Geonames data

  1. Download the Geonames data use to train the network here
  2. download the hierarchy data here
  3. unzip both file in the directory of your choice
  4. run the script train_test_split_geonames.py <geoname_filename>

Cooccurence data

  1. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : English Wikipedia Corpus
  2. Parse the corpus with Gensim script using the following command : python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz
  3. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.
    You can find here a page of interest file that contains places that appears in both FR and EN wikipedia.
  4. Then using and index that contains pages of interest run the command : python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz
  5. Finally, split the resulting dataset with the script train_test_split_cooccurrence_data.py <2ndoutputname>

If you're in a hurry

French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : https://projet.liris.cnrs.fr/hextgeo/files/


Train the network

To train the network with default parameter use the following command :

python3 combination_embeddings.py -i <geoname data filename> <hierarchy geonames data filename>

### Train the network with different parameters

We built a tiny module that allows to run the network training using different parameters. To do that use the GridSearchModel class in lib.run. You can find an example in the following code:

from lib.run import GridSearchModel
from collections import OrderedDict

grid = GridSearchModel(\
    "python3 combination_embeddings.py",
    **OrderedDict({ # We use an OrderedDict since the order of parameters is important
    "rel":["-i","-a","-c"],
    "-n":[4],
    "geoname_fn":"../data/geonamesData/US_FR.txt".split(),
    "hierarchy_fn":"../data/geonamesData/hierarchy.txt".split(),
    "store_true":["rel"]
    }.items()))
grid.run()

Available parameters

Parameter Description
-i,--inclusion Use inclusion relationships to train the network
-a,--adjacency Use adjacency relationships to train the network
-w,--wikipedia-coo Use Wikipedia place co-occurrences to train the network
--wikipedia-cooc-fn File that contains the coooccurrence data
--cooc-sample-size- Number of cooccurence relation selected for each location in cooccurrences data
--adjacency-iteration Number of iteration in the adjacency extraction process
-n,--ngram-size ngram size x
-t,--tolerance-value K-value in the computation of the accuracy@k
-e,--epochs number of epochs
-d,--dimension size of the ngram embeddings
--admin_code_1 (Optional) If you wish to train the network on a specific region

New model based on BERT embeddings

In the recent years, BERT architecture proposed by Google researches enables to outperform state-of-art methods for differents tasks in NLP (POS, NER, Classification). To verify if BERT embeddings would permit to increase the performance of our approach, we code a script to use bert with our data. In our previous model, the model returned two values each on between [0,1]. Using Bert, the task has shifted to classification (softmax) where each class correspond to a cell on the glob. We use the hierarchical projection model : Healpix. Other projections model like S2geometry can be considered : https://s2geometry.io/about/overview.

In order, to run this model training, run the bert.py script :

python3 bert.py <train_dataset> <test_dataset>

The train and test dataset are table data composed of two columns: sentence and label.