- Work on Place-embedding
- Setup environnement
- Install dependencies
- First approach : Embedding using places Wikipedia pages
- Step 1: Parse Wikipedia data !
- Step 2: Select and Filter entity from Wikidata
- Step 3: Extract data from Wikipedia pages
- Step 4 : Run Embedding extraction
- Available Parameters
- Output
- 2nd Approach: Geonames place embedding
- Generate the embedding
- Available Parameters
- Output files
- Embedding : train using concatenation of close places
- Prepare required data
- Train the network
- Available parameters
Work on Place-embedding
This repo contains various approach around geographic place embedding, and more precisely on its use for geocoding. At this moment, we designed three approaches :
- Use of geographic places Wikipedia pages to learn an embedding for toponyms
- Use Geonames place topology to produce an embedding using graph-embedding techniques
- Use toponym colocation (combination ?) based on spatial relatationships (inclusion, adjacency) for geocoding
Setup environnement
- Python3.6+
- Os free (all dependencies work on Windows !)
It is strongly advised to used Anaconda in a windows environnement!
Install dependencies
pip3 install -r requirements.txt
For Anaconda users
while read requirement; do conda install --yes $requirement; done < requirements.txt
First approach : Embedding using places Wikipedia pages
In this first approach, the goal is to produce embedding for place name. In order to do this, we designed a neural network that takes :
- Input: Text sequence (phrase)
- Output Latitute, Longitude, and the place type
Input texts are selected using Wikidata to filter Wikipedia pages about geographic places. Then, the filtered pages are retrieved on the Wikipedia corpus file. For each pages, we got :
- Title
- Introduction text
- Coordinates of the place (laatitude-Longitude)
- Place type (using a mapping between Wikidata and DBpedia Place subclasses)
Step 1: Parse Wikipedia data !
First, download the Wikipedia corpus in the wanted language, e.g. enwiki-latest-pages-articles.xml.bz2
Then, use the gensim
parser (doc here). Use the following command :
python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output>
Step 2: Select and Filter entity from Wikidata
We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command :
python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename>
Step 3: Extract data from Wikipedia pages
Using previous output, we extract text data from selected Wikipedia pages with the following command:
python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename>
Step 4 : Run Embedding extraction
To learn extract the place embedding, use the embeddings_lat_lon_type.py
Available Parameters
Parameter | Value (default) |
---|---|
--max_sequence_length | Maximum sequence length (15) |
--embedding_dimension | Embedding vector size (100) |
--batch_size | batch size used in the training (100) |
--epochs | Number of epochs (100) |
-v | Display the keras verbose |
Output
The different outputs (on for each neural network architecture) are put in the outputs
directory :
- outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : coordinates accuracy plot
- outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : place type accuracy plot
- outputs/Bi-GRU_100dim_20epoch_1000batch.csv : training history
- outputs/Bi-GRU_100dim_20epoch_1000batch.txt : embeddings
2nd Approach: Geonames place embedding
From this point, we change our vantage point by focusing our model propositions by using heavily spatial/geographical data, in this context gazetteer. In this second approach, we propose to generate an embedding for places (not place's toponym) based on their topology.
In order to do that, we use Geonames data to build a topology graph. This graph is generated based on intersection found between place buffer intersection.
(image ici)
Then, using topology network, we use node-embedding techniques to generate an embedding for each vertex (places).
Generate the embedding
First, download the Geonames dump : here
N.B. We advise you to take only the data from one country ! Topology network can be really dense and large !
python3 geonames_embedding.py <geonames dump(*.txt)>
Available Parameters
Parameter | Description (default) |
---|---|
--nbcpu | Number of CPU used for during the learning phase |
--vector-size | Embedding size |
--walk-length | Generated walk length |
--num-walks | Number of walks for each vertex (place) |
--word2vec-window-size | Window-size used in Word2vec |
--buffer-size | Buffer size used to detect adjacency relationships between places |
-d | Integrate distances between places in the topology graph |
--dist | Distance used if '-d' |
Output files
Gensim word2vec format is saved in the execution directory.
Embedding : train using concatenation of close places
Prepare required data
- download the Geonames data use to train the network here
- download the hierarchy data here
- unzip both file in the directory of your choice
- run the script
train_test_split_geonames.py <geoname_filename>
Train the network
The script combination_embeddings.py
is the one responsible of the neural network training
To train the network with default parameter use the following command :
python3 combination_embeddings.py -a -i <geoname data filename> <hierarchy geonames data filename>
Available parameters
Parameter | Description |
---|---|
-i,--inclusion | Use inclusion relationships to train the network |
-a,--adjacency | Use adjacency relationships to train the network |
-w,--wikipedia-coo | Use Wikipedia place co-occurrences to train the network |
-n,--ngram-size | ngram size |
-t,--tolerance-value | K-value in the computation of the accuracy@k |
-e,--epochs | number of epochs |
-d,--dimension | size of the ngram embeddings |
--admin_code_1 | (Optional) If you wish to train the network on a specificate region |