Toponym Geocoding
Use of ngram representation and colocation of toponyms in geography and text for geocoding
Setup environnement
- Python3.6+
- Os free (all dependencies should work on Windows !)
It is strongly advised to used Anaconda in a windows environnement!
Install dependencies
pip3 install -r requirements.txt
For Anaconda users
while read requirement; do conda install --yes $requirement; done < requirements.txt
Prepare required data
Geonames data
- Download the Geonames data use to train the network here
- download the hierarchy data here
- unzip both file in the directory of your choice
- run the script
train_test_split_geonames.py <geoname_filename>
Cooccurence data
- First, you must download the Wikipedia corpus from which you want to extract co-occurrences : English Wikipedia Corpus
- Parse the corpus with Gensim script using the following command :
python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz
- Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.
You can find here a page of interest file that contains places that appears in both FR and EN wikipedia. - Then using and index that contains pages of interest run the command :
python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz
- Finally, split the resulting dataset with the script
train_test_split_cooccurrence_data.py <2ndoutputname>
If you're in a hurry
French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : https://projet.liris.cnrs.fr/hextgeo/files/
Train the network
To train the network with default parameter use the following command :
python3 combination_embeddings.py -i <geoname data filename> <hierarchy geonames data filename>
### Train the network with different parameters
We built a tiny module that allows to run the network training using different parameters. To do that use the GridSearchModel class in lib.run
. You can find
an example in the following code:
from lib.run import GridSearchModel
from collections import OrderedDict
grid = GridSearchModel(\
"python3 combination_embeddings.py",
**OrderedDict({ # We use an OrderedDict since the order of parameters is important
"rel":["-i","-a","-c"],
"-n":[4],
"geoname_fn":"../data/geonamesData/US_FR.txt".split(),
"hierarchy_fn":"../data/geonamesData/hierarchy.txt".split(),
"store_true":["rel"]
}.items()))
grid.run()
Available parameters
Parameter | Description |
---|---|
-i,--inclusion | Use inclusion relationships to train the network |
-a,--adjacency | Use adjacency relationships to train the network |
-w,--wikipedia-coo | Use Wikipedia place co-occurrences to train the network |
--wikipedia-cooc-fn | File that contains the coooccurrence data |
--cooc-sample-size- | Number of cooccurence relation selected for each location in cooccurrences data |
--adjacency-iteration | Number of iteration in the adjacency extraction process |
-n,--ngram-size | ngram size x |
-t,--tolerance-value | K-value in the computation of the accuracy@k |
-e,--epochs | number of epochs |
-d,--dimension | size of the ngram embeddings |
--admin_code_1 | (Optional) If you wish to train the network on a specific region |
New model based on BERT embeddings
In the recent years, BERT architecture proposed by Google researches enables to outperform state-of-art methods for differents tasks in NLP (POS, NER, Classification). To verify if BERT embeddings would permit to increase the performance of our approach, we code a script to use bert with our data. In our previous model, the model returned two values each on between [0,1]. Using Bert, the task has shifted to classification (softmax) where each class correspond to a cell on the glob. We use the hierarchical projection model : Healpix. Other projections model like S2geometry can be considered : https://s2geometry.io/about/overview.
In order, to run this model training, run the bert.py
script :
python3 bert.py <train_dataset> <test_dataset>
The train and test dataset are table data composed of two columns: sentence and label.