Skip to content
Snippets Groups Projects
user avatar
Fize Jacques authored
c6cd0e18

Requirements

  • Python3.6+
  • Os free (all dependencies work on Windows !)

Install dependencies

pip3 install -r requirements.txt

Different approaches execution

Embedding using places Wikipedia pages

Three scripts need to be used :

  • 1_extractDataFromWikidata.py
  • 2_extractLearningDataset.py
  • 4_embeddings_lat_lon_type.py

Step 1: Parse Wikipedia data !

First, download the Wikipedia corpus in the wanted language, e.g. enwiki-latest-pages-articles.xml.bz2

Then, use the gensim parser (doc here). Use the following command :

python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output>

Step 2: Select and Filter entity from Wikidata

We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command :

python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename>

Step 3: Extract data from Wikipedia pages

Using previous output, we extract text data from selected Wikipedia pages with the following command:

python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename>

Step 4 : Run Embedding extraction

To learn extract the place embedding, use the 4_embeddings_lat_lon_type.py

Available Parameters

Parameter Value (default)
--max_sequence_length Maximum sequence length (15)
--embedding_dimension Embedding vector size (100)
--batch_size batch size used in the training (100)
--epochs Number of epochs (100)
-v Display the keras verbose

Output

The different outputs (on for each neural network architecture) are put in the outputs directory :

  • outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : coordinates accuracy plot
  • outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : place type accuracy plot
  • outputs/Bi-GRU_100dim_20epoch_1000batch.csv : training history
  • outputs/Bi-GRU_100dim_20epoch_1000batch.txt : embeddings

Geonames place embedding

First, download the Geonames dump here : https://download.geonames.org/export/dump/

N.B. We advise you to take only the data from one country ! (Adjacency graph need a lot of RAM).

python3 geonames_embedding.py <geonames dump(*.txt)>

Available Parameters

Parameter Value (default)
--nbcpu Cpu used for the embedding learning phase
--vector-size embedding size
--walk-length Generated Walk length
--num-walks Number of walks for each vertex (place)
--word2vec-window-size Window-size used in Word2vec
--buffer-size Buffer size used to detect adjacency relationships between places
-d Integrate distances between places in the topology graph
--dist Distance used if '-d'

Output

Gensim word2vec format is saved in the execution directory.

Embedding : train using concatenation of close places

Toponym Combination

positional arguments:
geoname_input         Filepath of the Geonames file you want to use.
geoname_hierachy_input
                        Filepath of the Geonames file you want to use.

optional arguments:
-h, --help            show this help message and exit
-v, --verbose
-n NGRAM_SIZE, --ngram-size NGRAM_SIZE
-t TOLERANCE_VALUE, --tolerance-value TOLERANCE_VALUE
-e EPOCHS, --epochs EPOCHS
-m {CNN,LSTM}, --model {CNN,LSTM}