This repository contains the code for "Using a deep neural network for toponym geocoding based on co-occurrences and spatial relations". In a nutshell, we propose to geocode place names using the less information available (two place names, one to geocode and the second used as context) and rely on deep learning network architecture.
Model architecture
The model is neural network. The first model is illustrated in the Figure 1. In a nutshell, the model aims to predict coordinates (output) from two place names. The first place name is the one we want to geocode and the second place name is used as context.
In a experiment (presented here), we found and assume that specific toponym affixes (suffix or prefix for example) are bound to certain geographic area. Based on this assumption, we decide to use n-gram sequence representation of input toponyms. For example, Paris will be transformed to Par,ari,ris.
Setup environnement
- Python3.6+
- Os free**
**It is strongly advised to used Anaconda in a Windows environnement!
Install dependencies
pip3 install -r requirements.txt
For Anaconda users
while read requirement; do conda install --yes $requirement; done < requirements.txt
Get Started
Get pre-trained model
Pre-trained model are available :
Geographic Area | Description | URL |
---|---|---|
FR | Model trained on the France populated places and area | https://projet.liris.cnrs.fr/hextgeo/files/trained_models/FR_MODEL_2.zip |
GB | Model trained on the England populated places and area | https://projet.liris.cnrs.fr/hextgeo/files/trained_models/GB_MODEL_2.zip |
US | Model trained on the United States of America populated places and area | https://projet.liris.cnrs.fr/hextgeo/files/trained_models/US_MODEL_2.zip |
Load and use the model
First thing is to import the dedicated module and load pre-trained model file. Here, we'll be using the France model.
from lib.geocoder.our_geocoder import Geocoder
g = Geocoder("FR_MODEL_2/FR.txt_100_4_100__A_C.h5","FR_MODEL_2/FR.txt_100_4_100__A_C_index")
To geocode a pair of toponym use the model.get_coord
method:
print(g.get_coord("Paris","France"))
#(2.7003836631774902, 41.24913454055786) #lon,lat
To reduce computation time, use the model.get_coords
to geocode multiple pairs of toponyms:
print(g.get_coords(["Paris","Paris"],["Cherbourg","Montpellier"]))
#(array([2.6039734, 3.480011 ], dtype=float32),
# array([48.27507 , 48.075943], dtype=float32))
Train your own model
We propose an implementation of the model illustrated in Figure 1 and a second based on the same input but using BERT pre-trained model.
Prepare data
The data preparation is divided into three steps. First, we retrieve required data from Geonames. Second, we retrieve place names co-occurrences from Wikipedia. Finally, we generate the datasets to train the model.
Geonames data
- Download the Geonames data use to train the network here
- download the hierarchy data here
- unzip both file in the directory of your choice
Cooccurence data
- First, you must download the Wikipedia corpus from which you want to extract co-occurrences : English Wikipedia Corpus
- Parse the corpus with Gensim script using the following command :
python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz
- Build a page of interest file that contains a list of Wikipedia pages. Use the script
extract_pages_of_interest.py
for that. You can find here a page of interest file that contains places that appears in FR or EN wikipedia. - Then, using the page of interest file, run the command :
python3 script/get_cooccurrence.py <page_of_interest_file> <final_output_name> -c <1stoutputname>.json.gz
Generate dataset
Use the following command to generate the datasets for training your model.
python3 generate_dataset.py <geonames_dataset> <wikipedia_dataset> <geonames_hierarchy_data>
Parameter | Description |
---|---|
--cooc-sampling | Number of cooccurrence sampled for each place in the cooccurrence dataset |
--adj-sampling | Number of adjacent relation extracted for each place in a Healpix cell |
--adj-nside | Healpix resolution where places within are considered adjacent |
--split-nside | Size of the zone where the train/test split are done |
--split-method | [per_pair|per_entity] Split each dataset based on places (place cannot exists in both train and test) or pairs(place can appears in train and test) |
--no-sampling | To avoid sampling in generated pairs |
If you're in a hurry
French (also GB,US) Geonames, French (also GB,US) Wikipedia co-occurrence data, and their train/test splits datasets can be found here : https://projet.liris.cnrs.fr/hextgeo/files/
Our model
To train the first model use the following command :
python3 train_geocoder.py <dataset_name> <inclusion_dataset> <adjacent_dataset> <cooccurrence_dataset> [-i | -a | -w ]+ [optional args]
Parameter | Description |
---|---|
-i,--inclusion | Use inclusion relationships to train the network |
-a,--adjacency | Use adjacency relationships to train the network |
-w,--wikipedia-coo | Use Wikipedia place co-occurrences to train the network |
-n,--ngram-size | ngram size |
-t,--tolerance-value | K-value in the computation of the accuracy@k (K unit is kilometer) |
-e,--epochs | number of epochs |
-d,--dimension | size of the ngram embeddings |
--admin_code_1 | (Optional) If you wish to train the network on a specific region |
[In Progress] BERT model
In the recent years, BERT architecture proposed by Google researches enables to outperform state-of-art methods for differents tasks in NLP (POS, NER, Classification). To verify if BERT embeddings would permit to increase the performance of our approach, we code a script to use bert with our data. In our previous model, the model returned two values each on between [0,1]. Using Bert, the task has shifted to classification (softmax) where each class correspond to a cell on the glob. We use the hierarchical projection model : Healpix. Other projections model like S2geometry can be considered : https://s2geometry.io/about/overview.
In order, to run this model training, run the train_bert_geocoder.py
script :
python3 train_bert_geocoder.py \
<train_dataset>\
<test_dataset>\
<output_dir>\
[--batch_size BATCH_SIZE | --epochs EPOCHS]
The train and test dataset are table data composed of two columns: sentence and label.
Pretrained-model
Pretrained model can be found here
Use BERT model
from lib.geocoder.bert_geocoder import BertGeocoder
geocoder = BertGeocoder(<bert_model_dir>,<label_healpix_file>)
geocoder.geocode(<toponyms>,<context,toponyms>)
Train multiple model with different parameters
We built a tiny module that allows to run the network training using different parameters. To do that use the GridSearchModel class in lib.run
. You can find
an example in the following code:
from lib.run import GridSearchModel
from collections import OrderedDict
grid = GridSearchModel(\
"python3 train_geocoder_v2.py",
**OrderedDict({ # We use an OrderedDict since the order of parameters is important
"rel":["-i","-a","-c"],
"-n":[4],
"geoname_fn":"../data/geonamesData/US_FR.txt".split(),
"hierarchy_fn":"../data/geonamesData/hierarchy.txt".split(),
"store_true":["rel"]
}.items()))
grid.run()
Authors and Acknowledgment
Proposed by Jacques Fize, Ludovic Moncla and Bruno Martins
This research is supported by an IDEXLYON project of the University of Lyon within the framework of the Investments for the Future Program (ANR-16-IDEX-0005). Bruno Martins was supported by the Fundação para a Ciência e a Tecnologia (FCT), through the project grants PTDC/CCI-CIF/32607/2017 CMIMU) and UIBD/50021/2020 (INESC-ID multi-annual funding).