Requirements
- Python3.6+
- Os free (all dependencies work on Windows !)
Install dependencies
pip3 install -r requirements.txt
Different approaches execution
Embedding using places Wikipedia pages
Three scripts need to be used :
- 1_extractDataFromWikidata.py
- 2_extractLearningDataset.py
- 4_embeddings_lat_lon_type.py
Step 1: Parse Wikipedia data !
First, download the Wikipedia corpus in the wanted language, e.g. enwiki-latest-pages-articles.xml.bz2
Then, use the gensim
parser (doc here). Use the following command :
python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output>
Step 2: Select and Filter entity from Wikidata
We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command :
python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename>
Step 3: Extract data from Wikipedia pages
Using previous output, we extract text data from selected Wikipedia pages with the following command:
python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename>
Step 4 : Run Embedding extraction
To learn extract the place embedding, use the 4_embeddings_lat_lon_type.py
Available Parameters
Parameter | Value (default) |
---|---|
--max_sequence_length | Maximum sequence length (15) |
--embedding_dimension | Embedding vector size (100) |
--batch_size | batch size used in the training (100) |
--epochs | Number of epochs (100) |
-v | Display the keras verbose |
Output
The different outputs (on for each neural network architecture) are put in the outputs
directory :
- outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : coordinates accuracy plot
- outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : place type accuracy plot
- outputs/Bi-GRU_100dim_20epoch_1000batch.csv : training history
- outputs/Bi-GRU_100dim_20epoch_1000batch.txt : embeddings
Geonames place embedding
First, download the Geonames dump here : https://download.geonames.org/export/dump/
N.B. We advise you to take only the data from one country ! (Adjacency graph need a lot of RAM).
python3 geonames_embedding.py <geonames dump(*.txt)>
Available Parameters
Parameter | Value (default) |
---|---|
--nbcpu | Cpu used for the embedding learning phase |
--vector-size | embedding size |
--walk-length | Generated Walk length |
--num-walks | Number of walks for each vertex (place) |
--word2vec-window-size | Window-size used in Word2vec |
--buffer-size | Buffer size used to detect adjacency relationships between places |
-d | Integrate distances between places in the topology graph |
--dist | Distance used if '-d' |
Output
Gensim word2vec format is saved in the execution directory.
Embedding : train using concatenation of close places
Toponym Combination
positional arguments:
geoname_input Filepath of the Geonames file you want to use.
geoname_hierachy_input
Filepath of the Geonames file you want to use.
optional arguments:
-h, --help show this help message and exit
-v, --verbose
-n NGRAM_SIZE, --ngram-size NGRAM_SIZE
-t TOLERANCE_VALUE, --tolerance-value TOLERANCE_VALUE
-e EPOCHS, --epochs EPOCHS
-m {CNN,LSTM}, --model {CNN,LSTM}