Skip to content
Snippets Groups Projects
Commit b648cf9e authored by Jacques Fize's avatar Jacques Fize
Browse files

Prune the README

parent 7496fba4
No related branches found
No related tags found
No related merge requests found
......@@ -25,112 +25,6 @@ For Anaconda users
<hr>
## First approach : Embedding using places Wikipedia pages
<div style="text-align:center">
<img src="documentation/imgs/first_approach.png"/>
<p>Figure 1 : First approach general workflow</p>
</div>
In this first approach, the goal is to produce embedding for place name. In order to do this, we designed a neural network that takes :
* **Input:** Text sequence (phrase)
* **Output** Latitute, Longitude, and the place type
Input texts are selected using Wikidata to filter Wikipedia pages about geographic places. Then, the filtered pages are retrieved on the Wikipedia corpus file. For each pages, we got :
* Title
* Introduction text
* Coordinates of the place (laatitude-Longitude)
* Place type (using a mapping between Wikidata and DBpedia Place subclasses)
### Step 1: Parse Wikipedia data !
First, download the Wikipedia corpus in the wanted language, *e.g. enwiki-latest-pages-articles.xml.bz2*
Then, use the `gensim` parser (doc [here](https://radimrehurek.com/gensim/scripts/segment_wiki.html)). Use the following command :
python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output>
### Step 2: Select and Filter entity from Wikidata
We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command :
python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename>
### Step 3: Extract data from Wikipedia pages
Using previous output, we extract text data from selected Wikipedia pages with the following command:
python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename>
### Step 4 : Run Embedding extraction
To learn extract the place embedding, use the `embeddings_lat_lon_type.py`
#### Available Parameters
| Parameter | Value (default) |
|------------------------|---------------------|
| --max_sequence_length | Maximum sequence length (15) |
| --embedding_dimension | Embedding vector size (100) |
| --batch_size | batch size used in the training (100) |
| --epochs | Number of epochs (100) |
| -v | Display the keras verbose |
#### Output
The different outputs (on for each neural network architecture) are put in the `outputs` directory :
* outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : **coordinates accuracy plot**
* outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : **place type accuracy plot**
* outputs/Bi-GRU_100dim_20epoch_1000batch.csv : **training history**
* outputs/Bi-GRU_100dim_20epoch_1000batch.txt : **embeddings**
<hr>
## 2nd Approach: Geonames place embedding
From this point, we change our vantage point by focusing our model propositions by using heavily spatial/geographical data, in this context gazetteer. In this second approach, we propose to generate an embedding for places (not place's toponym) based on their topology.
In order to do that, we use Geonames data to build a topology graph. This graph is generated based on intersection found between place buffer intersection.
(image ici)
Then, using topology network, we use node-embedding techniques to generate an embedding for each vertex (places).
<div style="text-align:center">
<img src="documentation/imgs/second_approach.png"/>
<p><strong>Figure 2</strong> : Second approach general workflow</p>
</div>
### Generate the embedding
First, download the Geonames dump : [here](https://download.geonames.org/export/dump/)
*N.B.* We advise you to take only the data from one country ! Topology network can be really dense and large !
python3 geonames_embedding.py <geonames dump(*.txt)>
### Available Parameters
| Parameter | Description (default) |
|------------------------|-------------------------------------------------------------------|
| --nbcpu | Number of CPU used for during the learning phase |
| --vector-size | Embedding size |
| --walk-length | Generated walk length |
| --num-walks | Number of walks for each vertex (place) |
| --word2vec-window-size | Window-size used in Word2vec |
| --buffer-size | Buffer size used to detect adjacency relationships between places |
| -d | Integrate distances between places in the topology graph |
| --dist | Distance used if '-d' |
### Output files
Gensim word2vec format is saved in the execution directory.
<hr>
## Embedding : train using concatenation of close places
<div style="text-align:center">
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment