CHANGE README + DEBUG and CLEANING

5a46b9f9 · Jacques Fize · c1530d9e · 5a46b9f9 · 5a46b9f9 · 5a46b9f9
Commit 5a46b9f9 authored 5 years ago by Jacques Fize
--- a/.gitattributes
+++ b/.gitattributes
+*.png filter=lfs diff=lfs merge=lfs -text
--- a/README.md
+++ b/README.md
-# Work on Place-embedding 
+# Toponym Geocoding
-This repo contains various approach around geographic place embedding, and more precisely on its use for geocoding. At this moment, we designed three approaches :
+Use of ngram representation and colocation of toponyms in geography and text for geocoding
+<div style="text-align:center">
+<img src="documentation/imgs/LSTM_arch.png"/>
+<p><strong>Figure 1</strong> : General workflow</p>
+</div>
- * Use of geographic places Wikipedia pages to learn an embedding for toponyms
- * Use Geonames place topology to produce an embedding using graph-embedding techniques
- * Use toponym colocation (combination ?) based on spatial relatationships (inclusion, adjacency) for geocoding
 <hr>
 ## Setup environnement
 - Python3.6+
- Os free (all dependencies work on Windows !)
+- Os free (all dependencies should work on Windows !)
 It is strongly advised to used Anaconda in a windows environnement! 
@@ -23,24 +25,32 @@ For Anaconda users
    while read requirement; do conda install --yes $requirement; done < requirements.txt
-<hr>
-## Embedding : train using concatenation of close places
-<div style="text-align:center">
-<img src="documentation/imgs/third_approach.png"/>
-<p><strong>Figure 3</strong> : Third approach general workflow</p>
-</div>
+<hr>
-### Prepare required data
+## Prepare required data
+### Geonames data
 * download the Geonames data use to train the network [here](download.geonames.org/export/dump/)
 * download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip)
 * unzip both file in the directory of your choice
 * run the script `train_test_split_geonames.py <geoname_filename>`
-### Train the network
+### Cooccurence data
+ * First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
+ * Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
+ * Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
+ * Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
+ * Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>`
+### If you're in a hurry
+French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
+<hr>
+## Train the network
 The script `combination_embeddings.py` is the one responsible of the neural network training
@@ -51,13 +61,16 @@ To train the network with default parameter use the following command :
 ### Available parameters
-| Parameter            | Description                                                          |
+| Parameter             | Description                                                                     |
-|----------------------|----------------------------------------------------------------------|
+|-----------------------|---------------------------------------------------------------------------------|
-| -i,--inclusion       | Use inclusion relationships to train the network                     |
+| -i,--inclusion        | Use inclusion relationships to train the network                                |
-| -a,--adjacency       | Use adjacency relationships to train the network                     |
+| -a,--adjacency        | Use adjacency relationships to train the network                                |
-| -w,--wikipedia-coo   | Use Wikipedia place co-occurrences to train the network              |
+| -w,--wikipedia-coo    | Use Wikipedia place co-occurrences to train the network                         |
-| -n,--ngram-size      | ngram size                                                           |
+| --wikipedia-cooc-fn   | File that contains the coooccurrence data                                       |
-| -t,--tolerance-value | K-value in the computation of the accuracy@k                         |
+| --cooc-sample-size-   | Number of cooccurence relation selected for each location in cooccurrences data |
-| -e,--epochs          | number of epochs                                                     |
+| --adjacency-iteration | Number of iteration in the adjacency extraction process                         |
-| -d,--dimension       | size of the ngram embeddings                                         |
+| -n,--ngram-size       | ngram size                                                                      |
-| --admin_code_1       |  (Optional) If you wish to train the network on a specificate region |
+| -t,--tolerance-value  | K-value in the computation of the accuracy@k                                    |
+| -e,--epochs           | number of epochs                                                                |
+| -d,--dimension        | size of the ngram embeddings                                                    |
+| --admin_code_1        | (Optional) If you wish to train the network on a specificate region             |
--- a/combination_embeddings.py
+++ b/combination_embeddings.py
@@ -80,18 +80,20 @@ EPOCHS = args.epochs
 ITER_ADJACENCY = args.adjacency_iteration
 COOC_SAMPLING_NUMBER = args.cooc_sample_size
 WORDVEC_ITER = args.ngram_word2vec_iter
+EMBEDDING_DIM = 256
 #################################################
 ########## FILENAME VARIABLE ####################
 #################################################
 GEONAME_FN = args.geoname_input
+DATASET_NAME = args.geoname_input.split("/")[-1]
 GEONAMES_HIERARCHY_FN = args.geoname_hierachy_input
 REGION_SUFFIX_FN = "" if args.admin_code_1 == "None" else "_" + args.admin_code_1
-ADJACENCY_REL_FILENAME = "../data/geonamesData/{0}_{1}{2}adjacency.json".format(
+ADJACENCY_REL_FILENAME = "{0}_{1}{2}adjacency.json".format(
-        GEONAME_FN.split("/")[-1],
+        GEONAME_FN,
        ITER_ADJACENCY,
        REGION_SUFFIX_FN)
-COOC_FN = "../data/wikipedia/cooccurrence_"+GEONAME_FN.split("/")[-1]
+COOC_FN = args.wikipedia_cooc_fn
 PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format(
    GEONAME_FN.split("/")[-1],
    EPOCHS,
@@ -99,15 +101,39 @@ PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format(
    ACCURACY_TOLERANCE,
    REGION_SUFFIX_FN)
+REL_CODE=""
 if args.adjacency:
    PREFIX_OUTPUT_FN += "_A"
+    REL_CODE+= "A"
 if args.inclusion:
    PREFIX_OUTPUT_FN += "_I"
+    REL_CODE+= "I"
 if args.wikipedia_cooc:
    PREFIX_OUTPUT_FN += "_C"
+    REL_CODE+= "C"
 MODEL_OUTPUT_FN = "outputs/{0}.h5".format(PREFIX_OUTPUT_FN)
 INDEX_FN = "outputs/{0}_index".format(PREFIX_OUTPUT_FN)
+HISTORY_FN = "outputs/{0}.csv".format(PREFIX_OUTPUT_FN)
+from lib.utils import MetaDataSerializer
+meta_data = MetaDataSerializer(
+    DATASET_NAME,
+    REL_CODE,
+    COOC_SAMPLING_NUMBER,
+    ITER_ADJACENCY,
+    NGRAM_SIZE,
+    ACCURACY_TOLERANCE,
+    EPOCHS,
+    EMBEDDING_DIM,
+    WORDVEC_ITER,
+    INDEX_FN,
+    MODEL_OUTPUT_FN,
+    HISTORY_FN
+)
+meta_data.save("outputs/{0}.json".format(PREFIX_OUTPUT_FN))
 #############################################################################################
 ################################# LOAD DATA #################################################
@@ -231,7 +257,7 @@ geoname_vec = {row.geonameid : zero_one_encoding(row.longitude,row.latitude) for
 del filtered
-embedding_dim = 256
+EMBEDDING_DIM = 256
 num_words = len(index.index_ngram) # necessary for the embedding matrix 
 logging.info("Preparing Input and Output data...")
@@ -288,7 +314,7 @@ if not os.path.exists("outputs/"):
 logging.info("Generating N-GRAM Embedding...")
-embedding_weights = index.get_embedding_layer(geoname2encodedname.values(),dim= embedding_dim,iter=WORDVEC_ITER)
+embedding_weights = index.get_embedding_layer(geoname2encodedname.values(),dim= EMBEDDING_DIM,iter=WORDVEC_ITER)
 logging.info("Embedding generated !")
 #############################################################################################
@@ -299,7 +325,7 @@ logging.info("Embedding generated !")
 input_1 = Input(shape=(index.max_len,))
 input_2 = Input(shape=(index.max_len,))
-embedding_layer = Embedding(num_words, embedding_dim,input_length=index.max_len,weights=[embedding_weights],trainable=False)#, trainable=True)
+embedding_layer = Embedding(num_words, EMBEDDING_DIM,input_length=index.max_len,weights=[embedding_weights],trainable=False)#, trainable=True)
 x1 = embedding_layer(input_1)
 x2 = embedding_layer(input_2)
@@ -311,14 +337,14 @@ x2 = Bidirectional(LSTM(98))(x2)
 x = concatenate([x1,x2])#,x3])
 x1 = Dense(500,activation="relu")(x)
-x1 = Dropout(0.3)(x1)
+# x1 = Dropout(0.3)(x1)
 x1 = Dense(500,activation="relu")(x1)
-x1 = Dropout(0.3)(x1)
+# x1 = Dropout(0.3)(x1)
 x2 = Dense(500,activation="relu")(x)
-x2 = Dropout(0.3)(x2)
+# x2 = Dropout(0.3)(x2)
 x2 = Dense(500,activation="relu")(x2)
-x2 = Dropout(0.3)(x2)
+# x2 = Dropout(0.3)(x2)
 output_lon = Dense(1,activation="sigmoid",name="Output_LON")(x1)
 output_lat = Dense(1,activation="sigmoid",name="Output_LAT")(x2)
@@ -344,7 +370,7 @@ history = model.fit(x=[X_1_train,X_2_train],
 hist_df = pd.DataFrame(history.history)
-hist_df.to_csv("outputs/{0}.csv".format(PREFIX_OUTPUT_FN))
+hist_df.to_csv(HISTORY_FN)
 model.save(MODEL_OUTPUT_FN)

--- a/documentation/imgs/LSTM_arch.png
+++ b/documentation/imgs/LSTM_arch.png
--- a/documentation/imgs/first_approach.png
+++ b/documentation/imgs/first_approach.png
--- a/documentation/imgs/second_approach.png
+++ b/documentation/imgs/second_approach.png
--- a/documentation/imgs/third_approach.png
+++ b/documentation/imgs/third_approach.png
--- a/lib/utils.py
+++ b/lib/utils.py
@@ -78,3 +78,47 @@ class ConfigurationReader(object):
        if not input_:
            return self.parser.parse_args()
        return self.parser.parse_args(input_)
+class MetaDataSerializer(object):
+    def __init__(self,
+    dataset_name,
+    rel_code,
+    cooc_sample_size,
+    adj_iteration,
+    ngram_size,
+    tolerance_value,
+    epochs,
+    embedding_dim,
+    word2vec_iter_nb,
+    index_fn,
+    keras_model_fn,
+    train_test_history_fn):
+        self.dataset_name = dataset_name
+        self.rel_code = rel_code
+        self.cooc_sample_size = cooc_sample_size
+        self.adj_iteration = adj_iteration
+        self.ngram_size = ngram_size
+        self.tolerance_value = tolerance_value
+        self.epochs = epochs
+        self.embedding_dim = embedding_dim
+        self.word2vec_iter_nb = word2vec_iter_nb
+        self.index_fn = index_fn
+        self.keras_model_fn = keras_model_fn
+        self.train_test_history_fn = train_test_history_fn
+    def save(self,fn):
+        json.dump({
+        "dataset_name" : self.dataset_name,
+        "rel_code" : self.rel_code,
+        "cooc_sample_size" : self.cooc_sample_size,
+        "adj_iteration" : self.adj_iteration,
+        "ngram_size" : self.ngram_size,
+        "tolerance_value" : self.tolerance_value,
+        "epochs" : self.epochs,
+        "embedding_dim" : self.embedding_dim,
+        "word2vec_iter_nb" : self.word2vec_iter_nb,
+        "index_fn" : self.index_fn,
+        "keras_model_fn" : self.keras_model_fn,
+        "train_test_history_fn" : self.train_test_history_fn
+        },open(fn,'w'))
\ No newline at end of file
--- a/parser_config/toponym_combination_embedding.json
+++ b/parser_config/toponym_combination_embedding.json
@@ -7,6 +7,7 @@
        { "short": "-i", "long": "--inclusion", "action": "store_true" },
        { "short": "-a", "long": "--adjacency", "action": "store_true" },
        { "short": "-w", "long": "--wikipedia-cooc", "action": "store_true" },
+        { "long": "--wikipedia-cooc-fn","help":"Cooccurrence data filename"},
        { "long": "--cooc-sample-size", "type": "int", "default": 3 },
        {"long": "--adjacency-iteration", "type":"int","default":1},
        { "short": "-n", "long": "--ngram-size", "type": "int", "default": 2 },

--- a/scripts/get_cooccurrence.py
+++ b/scripts/get_cooccurrence.py
+import gzip
+import json
+import re
+import argparse
+import pandas as pd
+from joblib import Parallel,delayed
+from tqdm import tqdm
+parser = argparse.ArgumentParser()
+parser.add_argument("page_of_interest_fn")
+parser.add_argument("output_fn")
+parser.add_argument("-c","--corpus",action="append")
+args = parser.parse_args()#("../wikidata/sample/place_en_fr_page_clean_onlyfrplace.csv test.txt -c frwiki-latest.json.gz -c enwiki-latest.json.gz".split())
+PAGES_OF_INTEREST_FILE = args.page_of_interest_fn
+WIKIPEDIA_CORPORA = args.corpus
+OUTPUT_FN = args.output_fn
+if len(WIKIPEDIA_CORPORA)<1:
+    raise Exception('No corpora was given!')
+df = pd.read_csv(PAGES_OF_INTEREST_FILE)
+page_of_interest = set(df.title.values)
+page_coord = {row.title : (row.longitude,row.latitude) for ix,row in df.iterrows()}
+output = open(OUTPUT_FN,'w')
+output.write("title\tinterlinks\tlongitude\tlatitude\n")
+for wikipedia_corpus in WIKIPEDIA_CORPORA:
+    for line in tqdm(gzip.GzipFile(wikipedia_corpus,'rb')):
+        data = json.loads(line)
+        if data["title"] in page_of_interest:
+            occ = page_of_interest.intersection(data["interlinks"].keys())
+            coord = page_coord[data["title"]]
+            if len(occ) >0:output.write(data["title"]+"\t"+"|".join(occ)+"\t{0}\t{1}".format(*coord)+"\n")
--- a/train_test_split_cooccurrence_data.py
+++ b/train_test_split_cooccurrence_data.py
@@ -20,7 +20,7 @@ from tqdm import tqdm
 parser = argparse.ArgumentParser()
 parser.add_argument("cooccurrence_file")
-args = parser.parse_args("data/wikipedia/cooccurrence_FR.txt".split())#("data/geonamesData/FR.txt".split())
+args = parser.parse_args()#("data/wikipedia/cooccurrence_FR.txt".split())#("data/geonamesData/FR.txt".split())
 # LOAD DATAgeopandas
 COOC_FN = args.cooccurrence_file
@@ -82,4 +82,4 @@ del X_test["nn"]
 # SAVING THE DATA
 logging.info("Saving Output !")
 X_train.to_csv(COOC_FN+"_train.csv")
 X_test.to_csv(COOC_FN+"_test.csv")
\ No newline at end of file