Prune the README

b648cf9e · Jacques Fize · 7496fba4 · b648cf9e
Commit b648cf9e authored 5 years ago by Jacques Fize
--- a/README.md
+++ b/README.md
@@ -25,112 +25,6 @@ For Anaconda users
 <hr>
-## First approach : Embedding using places Wikipedia pages
-<div style="text-align:center">
-<img src="documentation/imgs/first_approach.png"/>
-<p>Figure 1 : First approach general workflow</p>
-</div>
-In this first approach, the goal is to produce embedding for place name. In order to do this, we designed a neural network that takes :
-* **Input:** Text sequence (phrase)
-* **Output** Latitute, Longitude, and the place type
-Input texts are selected using Wikidata to filter Wikipedia pages about geographic places. Then, the filtered pages are retrieved on the Wikipedia corpus file. For each pages, we got :
-* Title
-* Introduction text
-* Coordinates of the place (laatitude-Longitude)
-* Place type (using a mapping between Wikidata and DBpedia Place subclasses)
-### Step 1: Parse Wikipedia data !
-First, download the Wikipedia corpus in the wanted language, *e.g. enwiki-latest-pages-articles.xml.bz2*
-Then, use the `gensim` parser (doc [here](https://radimrehurek.com/gensim/scripts/segment_wiki.html)). Use the following command :
-    python3 -m gensim.scripts.segment_wiki -i -f <wikipedia_dump_file> -o <output>
-### Step 2: Select and Filter entity from Wikidata
-We use Wikidata to identify which Wikipedia pages concern a place. Simply, run the following command : 
-    python3 1_extractDataFromWikidata.py <Wikidata Dump (.gz)> <output_filename>
-### Step 3: Extract data from Wikipedia pages
-Using previous output, we extract text data from selected Wikipedia pages with the following command:
-    python3 2_extractLearningDataset.py <wikipedia_filename (output from step 1)> <wikidata_extract(output from step2)> <output_filename>
-### Step 4 : Run Embedding extraction
-To learn extract the place embedding, use the `embeddings_lat_lon_type.py`
-#### Available Parameters
-| Parameter              | Value (default)     |
-|------------------------|---------------------|
-| --max_sequence_length          | Maximum sequence length (15) |
-| --embedding_dimension           | Embedding vector size (100)             |
-| --batch_size | batch size used in the training (100)             |
-| --epochs         | Number of epochs (100) |
-| -v                     | Display the keras verbose          |
-#### Output
-The different outputs (on for each neural network architecture) are put in the `outputs` directory : 
-* outputs/Bi-GRU_100dim_20epoch_1000batch__coord.png : **coordinates accuracy plot**
-* outputs/Bi-GRU_100dim_20epoch_1000batch__place_type.png : **place type accuracy plot**
-* outputs/Bi-GRU_100dim_20epoch_1000batch.csv : **training history**
-* outputs/Bi-GRU_100dim_20epoch_1000batch.txt : **embeddings**
-<hr>
-## 2nd Approach: Geonames place embedding
-From this point, we change our vantage point by focusing our model propositions by using heavily spatial/geographical data, in this context gazetteer. In this second approach, we propose to generate an embedding for places (not place's toponym) based on their topology.
-In order to do that, we use Geonames data to build a topology graph. This graph is generated based on intersection found between place buffer intersection.
-(image ici)
-Then, using topology network, we use node-embedding techniques to generate an embedding for each vertex (places).
-<div style="text-align:center">
-<img src="documentation/imgs/second_approach.png"/>
-<p><strong>Figure 2</strong> : Second approach general workflow</p>
-</div>
-### Generate the embedding
-First, download the Geonames dump : [here](https://download.geonames.org/export/dump/)
-*N.B.* We advise you to take only the data from one country ! Topology network can be really dense and large !
-    python3 geonames_embedding.py <geonames dump(*.txt)>
-### Available Parameters
-| Parameter              | Description (default)                                             |
-|------------------------|-------------------------------------------------------------------|
-| --nbcpu                | Number of CPU used for during the learning phase                  |
-| --vector-size          | Embedding size                                                    |
-| --walk-length          | Generated walk length                                             |
-| --num-walks            | Number of walks for each vertex (place)                           |
-| --word2vec-window-size | Window-size used in Word2vec                                      |
-| --buffer-size          | Buffer size used to detect adjacency relationships between places |
-| -d                     | Integrate distances between places in the topology graph          |
-| --dist                 | Distance used if '-d'                                             |
-### Output files 
-Gensim word2vec format is saved in the execution directory.
-<hr>
 ## Embedding : train using concatenation of close places
 <div style="text-align:center">