diff --git a/README.md b/README.md index 3eee5842dba0dccf55dd541f5655c33ea92186ee..f184ed7a1c7cf6a237b96d9856f07eb92c3069c7 100644 --- a/README.md +++ b/README.md @@ -25,112 +25,6 @@ For Anaconda users <hr> -## First approach : Embedding using places Wikipedia pages - -<div style="text-align:center"> -<img src="documentation/imgs/first_approach.png"/> -<p>Figure 1 : First approach general workflow</p> -</div> - -In this first approach, the goal is to produce embedding for place name. In order to do this, we designed a neural network that takes : - -* **Input:** Text sequence (phrase) -* **Output** Latitute, Longitude, and the place type - -Input texts are selected using Wikidata to filter Wikipedia pages about geographic places. Then, the filtered pages are retrieved on the Wikipedia corpus file. For each pages, we got : - -* Title -* Introduction text -* Coordinates of the place (laatitude-Longitude) -* Place type (using a mapping between Wikidata and DBpedia Place subclasses) - -### Step 1: Parse Wikipedia data ! - -First, download the Wikipedia corpus in the wanted language, *e.g. enwiki-latest-pages-articles.xml.bz2* - -Then, use the `gensim` parser (doc [here](https://radimrehurek.com/gensim/scripts/segment_wiki.html)). Use the following command : - - python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output> - -### Step 2: Select and Filter entity from Wikidata - -We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command : - - python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename> - -### Step 3: Extract data from Wikipedia pages - -Using previous output, we extract text data from selected Wikipedia pages with the following command: - - python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename> - -### Step 4 : Run Embedding extraction - -To learn extract the place embedding, use the `embeddings_lat_lon_type.py` - -#### Available Parameters - -| Parameter | Value (default) | -|------------------------|---------------------| -| --max_sequence_length | Maximum sequence length (15) | -| --embedding_dimension | Embedding vector size (100) | -| --batch_size | batch size used in the training (100) | -| --epochs | Number of epochs (100) | -| -v | Display the keras verbose | - -#### Output - -The different outputs (on for each neural network architecture) are put in the `outputs` directory : - -* outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : **coordinates accuracy plot** -* outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : **place type accuracy plot** -* outputs/Bi-GRU_100dim_20epoch_1000batch.csv : **training history** -* outputs/Bi-GRU_100dim_20epoch_1000batch.txt : **embeddings** - -<hr> - -## 2nd Approach: Geonames place embedding - -From this point, we change our vantage point by focusing our model propositions by using heavily spatial/geographical data, in this context gazetteer. In this second approach, we propose to generate an embedding for places (not place's toponym) based on their topology. - -In order to do that, we use Geonames data to build a topology graph. This graph is generated based on intersection found between place buffer intersection. - -(image ici) - -Then, using topology network, we use node-embedding techniques to generate an embedding for each vertex (places). - -<div style="text-align:center"> -<img src="documentation/imgs/second_approach.png"/> -<p><strong>Figure 2</strong> : Second approach general workflow</p> -</div> - -### Generate the embedding - -First, download the Geonames dump : [here](https://download.geonames.org/export/dump/) - -*N.B.* We advise you to take only the data from one country ! Topology network can be really dense and large ! - - python3 geonames_embedding.py <geonames dump(*.txt)> - -### Available Parameters - -| Parameter | Description (default) | -|------------------------|-------------------------------------------------------------------| -| --nbcpu | Number of CPU used for during the learning phase | -| --vector-size | Embedding size | -| --walk-length | Generated walk length | -| --num-walks | Number of walks for each vertex (place) | -| --word2vec-window-size | Window-size used in Word2vec | -| --buffer-size | Buffer size used to detect adjacency relationships between places | -| -d | Integrate distances between places in the topology graph | -| --dist | Distance used if '-d' | - -### Output files - -Gensim word2vec format is saved in the execution directory. - -<hr> - ## Embedding : train using concatenation of close places <div style="text-align:center">