Skip to content
Snippets Groups Projects
user avatar
Fize Jacques authored
1f6d2ac8

This repository contains the code for "Using a deep neural network for toponym geocoding based on co-occurrences and spatial relations". In a nutshell, we propose to geocode place names using the less information available (two place names, one to geocode and the second used as context) and rely on deep learning network architecture.


Model architecture

The model is neural network. The first model is illustrated in the Figure 1. In a nutshell, the model aims to predict coordinates (output) from two place names. The first place name is the one we want to geocode and the second place name is used as context.

In a experiment (presented here), we found and assume that specific toponym affixes (suffix or prefix for example) are bound to certain geographic area. Based on this assumption, we decide to use n-gram sequence representation of input toponyms. For example, Paris will be transformed to Par,ari,ris.

Figure 1 : General workflow


Setup environnement

  • Python3.6+
  • Os free**

**It is strongly advised to used Anaconda in a Windows environnement!

Install dependencies

pip3 install -r requirements.txt

For Anaconda users

while read requirement; do conda install --yes $requirement; done < requirements.txt

Get Started

Get pre-trained model

Pre-trained model are available :

Geographic Area Description URL
FR Model trained on the France populated places and area https://projet.liris.cnrs.fr/hextgeo/files/trained_models/FR_MODEL_2.zip
GB Model trained on the England populated places and area https://projet.liris.cnrs.fr/hextgeo/files/trained_models/GB_MODEL_2.zip
US Model trained on the United States of America populated places and area https://projet.liris.cnrs.fr/hextgeo/files/trained_models/US_MODEL_2.zip

Load and use the model

First thing is to import the dedicated module and load pre-trained model file. Here, we'll be using the France model.

from lib.geocoder.our_geocoder import Geocoder
g = Geocoder("FR_MODEL_2/FR.txt_100_4_100__A_C.h5","FR_MODEL_2/FR.txt_100_4_100__A_C_index")

To geocode a pair of toponym use the model.get_coord method:

print(g.get_coord("Paris","France"))
#(2.7003836631774902, 41.24913454055786) #lon,lat

To reduce computation time, use the model.get_coords to geocode multiple pairs of toponyms:

print(g.get_coords(["Paris","Paris"],["Cherbourg","Montpellier"]))
#(array([2.6039734, 3.480011 ], dtype=float32),
# array([48.27507 , 48.075943], dtype=float32))

Train your own model

We propose an implementation of the model illustrated in Figure 1 and a second based on the same input but using BERT pre-trained model.

Prepare data

The data preparation is divided into three steps. First, we retrieve required data from Geonames. Second, we retrieve place names co-occurrences from Wikipedia. Finally, we generate the datasets to train the model.

Geonames data

  1. Download the Geonames data use to train the network here
  2. download the hierarchy data here
  3. unzip both file in the directory of your choice

Cooccurence data

  1. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : English Wikipedia Corpus
  2. Parse the corpus with Gensim script using the following command : python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz
  3. Build a page of interest file that contains a list of Wikipedia pages. Use the script extract_pages_of_interest.py for that. You can find here a page of interest file that contains places that appears in FR or EN wikipedia.
  4. Then, using the page of interest file, run the command : python3 script/get_cooccurrence.py <page_of_interest_file> <final_output_name> -c <1stoutputname>.json.gz

Generate dataset

Use the following command to generate the datasets for training your model.

python3 generate_dataset.py <geonames_dataset> <wikipedia_dataset> <geonames_hierarchy_data>
Parameter Description
--cooc-sampling Number of cooccurrence sampled for each place in the cooccurrence dataset
--adj-sampling Number of adjacent relation extracted for each place in a Healpix cell
--adj-nside Healpix resolution where places within are considered adjacent
--split-nside Size of the zone where the train/test split are done
--split-method [per_pair|per_entity] Split each dataset based on places (place cannot exists in both train and test) or pairs(place can appears in train and test)
--no-sampling To avoid sampling in generated pairs

If you're in a hurry

French (also GB,US) Geonames, French (also GB,US) Wikipedia co-occurrence data, and their train/test splits datasets can be found here : https://projet.liris.cnrs.fr/hextgeo/files/

Our model

To train the first model use the following command :

python3 train_geocoder.py <dataset_name> <inclusion_dataset> <adjacent_dataset> <cooccurrence_dataset> [-i | -a | -w ]+ [optional args]
Parameter Description
-i,--inclusion Use inclusion relationships to train the network
-a,--adjacency Use adjacency relationships to train the network
-w,--wikipedia-coo Use Wikipedia place co-occurrences to train the network
-n,--ngram-size ngram size
-t,--tolerance-value K-value in the computation of the accuracy@k (K unit is kilometer)
-e,--epochs number of epochs
-d,--dimension size of the ngram embeddings
--admin_code_1 (Optional) If you wish to train the network on a specific region

[In Progress] BERT model

In the recent years, BERT architecture proposed by Google researches enables to outperform state-of-art methods for differents tasks in NLP (POS, NER, Classification). To verify if BERT embeddings would permit to increase the performance of our approach, we code a script to use bert with our data. In our previous model, the model returned two values each on between [0,1]. Using Bert, the task has shifted to classification (softmax) where each class correspond to a cell on the glob. We use the hierarchical projection model : Healpix. Other projections model like S2geometry can be considered : https://s2geometry.io/about/overview.

In order, to run this model training, run the train_bert_geocoder.py script :

python3 train_bert_geocoder.py \
<train_dataset>\
<test_dataset>\
<output_dir>\
[--batch_size BATCH_SIZE | --epochs EPOCHS]

The train and test dataset are table data composed of two columns: sentence and label.

Pretrained-model

Pretrained model can be found here

Use BERT model

from lib.geocoder.bert_geocoder import BertGeocoder
geocoder = BertGeocoder(<bert_model_dir>,<label_healpix_file>)
geocoder.geocode(<toponyms>,<context,toponyms>)

Train multiple model with different parameters

We built a tiny module that allows to run the network training using different parameters. To do that use the GridSearchModel class in lib.run. You can find an example in the following code:

from lib.run import GridSearchModel
from collections import OrderedDict

grid = GridSearchModel(\
    "python3 train_geocoder_v2.py",
    **OrderedDict({ # We use an OrderedDict since the order of parameters is important
    "rel":["-i","-a","-c"],
    "-n":[4],
    "geoname_fn":"../data/geonamesData/US_FR.txt".split(),
    "hierarchy_fn":"../data/geonamesData/hierarchy.txt".split(),
    "store_true":["rel"]
    }.items()))
grid.run()

Authors and Acknowledgment

Proposed by Jacques Fize, Ludovic Moncla and Bruno Martins

This research is supported by an IDEXLYON project of the University of Lyon within the framework of the Investments for the Future Program (ANR-16-IDEX-0005). Bruno Martins was supported by the Fundação para a Ciência e a Tecnologia (FCT), through the project grants PTDC/CCI-CIF/32607/2017 CMIMU) and UIBD/50021/2020 (INESC-ID multi-annual funding).