Skip to content
Snippets Groups Projects
Select Git revision
  • 70d5b1900908c5e59687e0e778c72a4976af4ded
  • master default protected
2 results

README.md

Blame
  • Work on Place-embedding

    This repo contains various approach around geographic place embedding, and more precisely on its use for geocoding. At this moment, we designed three approaches :

    • Use of geographic places Wikipedia pages to learn an embedding for toponyms
    • Use Geonames place topology to produce an embedding using graph-embedding techniques
    • Use toponym colocation (combination ?) based on spatial relatationships (inclusion, adjacency) for geocoding

    Setup environnement

    • Python3.6+
    • Os free (all dependencies work on Windows !)

    It is strongly advised to used Anaconda in a windows environnement!

    Install dependencies

    pip3 install -r requirements.txt

    For Anaconda users

    while read requirement; do conda install --yes $requirement; done < requirements.txt

    First approach : Embedding using places Wikipedia pages

    Figure 1 : First approach general workflow

    In this first approach, the goal is to produce embedding for place name. In order to do this, we designed a neural network that takes :

    • Input: Text sequence (phrase)
    • Output Latitute, Longitude, and the place type

    Input texts are selected using Wikidata to filter Wikipedia pages about geographic places. Then, the filtered pages are retrieved on the Wikipedia corpus file. For each pages, we got :

    • Title
    • Introduction text
    • Coordinates of the place (laatitude-Longitude)
    • Place type (using a mapping between Wikidata and DBpedia Place subclasses)

    Step 1: Parse Wikipedia data !

    First, download the Wikipedia corpus in the wanted language, e.g. enwiki-latest-pages-articles.xml.bz2

    Then, use the gensim parser (doc here). Use the following command :

    python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output>

    Step 2: Select and Filter entity from Wikidata

    We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command :

    python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename>

    Step 3: Extract data from Wikipedia pages

    Using previous output, we extract text data from selected Wikipedia pages with the following command:

    python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename>

    Step 4 : Run Embedding extraction

    To learn extract the place embedding, use the embeddings_lat_lon_type.py

    Available Parameters

    Parameter Value (default)
    --max_sequence_length Maximum sequence length (15)
    --embedding_dimension Embedding vector size (100)
    --batch_size batch size used in the training (100)
    --epochs Number of epochs (100)
    -v Display the keras verbose

    Output

    The different outputs (on for each neural network architecture) are put in the outputs directory :

    • outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : coordinates accuracy plot
    • outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : place type accuracy plot
    • outputs/Bi-GRU_100dim_20epoch_1000batch.csv : training history
    • outputs/Bi-GRU_100dim_20epoch_1000batch.txt : embeddings

    2nd Approach: Geonames place embedding

    From this point, we change our vantage point by focusing our model propositions by using heavily spatial/geographical data, in this context gazetteer. In this second approach, we propose to generate an embedding for places (not place's toponym) based on their topology.

    In order to do that, we use Geonames data to build a topology graph. This graph is generated based on intersection found between place buffer intersection.

    (image ici)

    Then, using topology network, we use node-embedding techniques to generate an embedding for each vertex (places).

    Figure 2 : Second approach general workflow

    Generate the embedding

    First, download the Geonames dump : here

    N.B. We advise you to take only the data from one country ! Topology network can be really dense and large !

    python3 geonames_embedding.py <geonames dump(*.txt)>

    Available Parameters

    Parameter Description (default)
    --nbcpu Number of CPU used for during the learning phase
    --vector-size Embedding size
    --walk-length Generated walk length
    --num-walks Number of walks for each vertex (place)
    --word2vec-window-size Window-size used in Word2vec
    --buffer-size Buffer size used to detect adjacency relationships between places
    -d Integrate distances between places in the topology graph
    --dist Distance used if '-d'

    Output files

    Gensim word2vec format is saved in the execution directory.


    Embedding : train using concatenation of close places

    Figure 3 : Third approach general workflow

    Prepare required data

    • download the Geonames data use to train the network here
    • download the hierarchy data here
    • unzip both file in the directory of your choice
    • run the script train_test_split_geonames.py <geoname_filename>

    Train the network

    The script combination_embeddings.py is the one responsible of the neural network training

    To train the network with default parameter use the following command :

    python3 combination_embeddings.py -a -i <geoname data filename> <hierarchy geonames data filename>

    Available parameters

    Parameter Description
    -i,--inclusion Use inclusion relationships to train the network
    -a,--adjacency Use adjacency relationships to train the network
    -w,--wikipedia-coo Use Wikipedia place co-occurrences to train the network
    -n,--ngram-size ngram size
    -t,--tolerance-value K-value in the computation of the accuracy@k
    -e,--epochs number of epochs
    -d,--dimension size of the ngram embeddings
    --admin_code_1 (Optional) If you wish to train the network on a specificate region