diff --git a/LICENSE b/LICENSE
new file mode 100644
index 0000000000000000000000000000000000000000..ff31b15b996add91de07bb25a7936cee5a4207f1
--- /dev/null
+++ b/LICENSE
@@ -0,0 +1,21 @@
+MIT License
+
+Copyright (c) 2020 Jacques Fize, Ludovic Moncla and Bruno Martins
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
diff --git a/README.md b/README.md
index e69de29bb2d1d6434b8b29ae775ad8c2e48c5391..f5c52880524af61be5a7628a5e0646e81e8514b8 100644
--- a/README.md
+++ b/README.md
@@ -0,0 +1,194 @@
+
+
+This repository contains the code for *"Using a deep neural network for toponym geocoding based on co-occurrences and spatial relations"*. In a nutshell, we propose to geocode place names using the less information available (two place names, one to geocode and the second used as context) and rely on deep learning network architecture.
+
+
+
+<hr>
+
+# Model architecture
+
+The model is neural network. The first model is illustrated in the Figure 1. In a nutshell, the model aims to predict coordinates (output) from two place names. The first place name is the one we want to geocode and the second place name is used as context.
+
+In a experiment (presented [here](https://jacobe2169.github.io/mapthetoponymsim/)), we found and assume that specific toponym affixes (suffix or prefix for example) are bound to certain geographic area. Based on this assumption, we decide to use n-gram sequence representation of input toponyms. For example, Paris will be transformed to Par,ari,ris.
+
+<div style="text-align:center">
+<img src="documentation/imgs/LSTM_archv2.png"/>
+<p><strong>Figure 1</strong> : General workflow</p>
+</div>
+
+<hr>
+
+
+# Setup environnement
+
+- Python3.6+
+- Os free**
+
+***It is strongly advised to used Anaconda in a Windows environnement!*
+
+## Install dependencies
+
+    pip3 install -r requirements.txt
+
+For Anaconda users
+
+    while read requirement; do conda install --yes $requirement; done < requirements.txt
+
+
+
+<hr>
+
+# Get Started
+
+## Get pre-trained model
+
+Pre-trained model are available :
+
+| Geographic Area | Description                                                             | URL                                                                      |
+|-----------|-------------------------------------------------------------------------|--------------------------------------------------------------------------|
+| FR        | Model trained on the France populated places and area                   | https://projet.liris.cnrs.fr/hextgeo/files/trained_models/FR_MODEL_2.zip |
+| GB        | Model trained on the England populated places and area                  | https://projet.liris.cnrs.fr/hextgeo/files/trained_models/GB_MODEL_2.zip |
+| US        | Model trained on the United States of America populated places and area | https://projet.liris.cnrs.fr/hextgeo/files/trained_models/US_MODEL_2.zip |
+
+## Load and use the model
+
+First thing is to import the dedicated module and load pre-trained model file. Here, we'll be using the France model.
+
+```python
+from lib.geocoder.our_geocoder import Geocoder
+g = Geocoder("FR_MODEL_2/FR.txt_100_4_100__A_C.h5","FR_MODEL_2/FR.txt_100_4_100__A_C_index")
+
+```
+
+To geocode a pair of toponym use the `model.get_coord` method:
+```python
+print(g.get_coord("Paris","France"))
+#(2.7003836631774902, 41.24913454055786) #lon,lat
+```
+
+To reduce computation time, use the `model.get_coords` to geocode multiple pairs of toponyms:
+
+```python
+print(g.get_coords(["Paris","Paris"],["Cherbourg","Montpellier"]))
+#(array([2.6039734, 3.480011 ], dtype=float32),
+# array([48.27507 , 48.075943], dtype=float32))
+
+```
+
+<hr>
+
+# Train your own model
+
+We propose an implementation of the model illustrated in Figure 1 and a second based on the same input but using BERT pre-trained model.
+
+
+## Prepare data
+
+The data preparation is divided into three steps. First, we retrieve required data from Geonames. Second, we retrieve place names co-occurrences from Wikipedia. Finally, we generate the datasets to train the model.
+
+### Geonames data
+ 1. Download the Geonames data use to train the network [here](download.geonames.org/export/dump/)
+ 2. download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip)
+ 3. unzip both file in the directory of your choice
+
+### Cooccurence data
+
+ 5. First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
+ 6. Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
+ 7. Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
+ 8. Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
+
+### Generate dataset
+
+Use the following command to generate the datasets for training your model.
+
+    python3 generate_dataset.py <geonames_dataset> <wikipedia_dataset> <geonames_hierarchy_data>
+
+
+| Parameter       | Description                                                                                                                                          |
+|-----------------|------------------------------------------------------------------------------------------------------------------------------------------------------|
+| --cooc-sampling | Number of cooccurrence sampled for each place in the cooccurrence dataset                                                                            |
+| --adj-sampling  | Number of adjacent relation extracted for each place in a Healpix cell                                                                               |
+| --adj-nside     | Healpix resolution where places within are considered adjacent                                                                                       |
+| --split-nside   | Size of the zone where the train/test split are done                                                                                                 |
+| --split-method  | [per_pair\|per_entity] Split each dataset based on places (place cannot exists in both train and test) or pairs(place can appears in train and test) |
+
+### If you're in a hurry
+
+French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
+
+
+## Our model
+
+To train the first model use the following command :
+
+    python3 train_geocoder_v2.py <dataset_name> <inclusion_dataset> <adjacent_dataset> <cooccurrence_dataset> [-i | -a | -w ]+ [optional args]
+
+| Parameter             | Description                                                                     |
+|-----------------------|---------------------------------------------------------------------------------|
+| -i,--inclusion        | Use inclusion relationships to train the network                                |
+| -a,--adjacency        | Use adjacency relationships to train the network                                |
+| -w,--wikipedia-coo    | Use Wikipedia place co-occurrences to train the network                         |
+| -n,--ngram-size       | ngram size                                                                      |
+| -t,--tolerance-value  | K-value in the computation of the accuracy@k (K unit is kilometer)                                    |
+| -e,--epochs           | number of epochs                                                                |
+| -d,--dimension        | size of the ngram embeddings                                                    |
+| --admin_code_1        | (Optional) If you wish to train the network on a specific region             |
+
+
+<hr>
+
+# [In Progress] BERT model
+In the recent years, BERT architecture proposed by Google researches enables to outperform state-of-art methods for differents tasks in NLP (POS, NER, Classification). To verify if BERT embeddings would permit to increase the performance of our approach, we code a script to use bert with our data. In our previous model, the model returned two values each on between [0,1]. Using Bert, the task has shifted to classification (softmax) where each class correspond to a cell on the glob. We use the hierarchical projection model : Healpix. Other projections model like S2geometry can be considered : https://s2geometry.io/about/overview.
+
+In order, to run this model training, run the `train_bert_geocoder.py` script :
+
+    python3 train_bert_geocoder.py \
+    <train_dataset>\
+    <test_dataset>\
+    <output_dir>\
+    [--batch_size BATCH_SIZE | --epochs EPOCHS]
+
+The train and test dataset are table data composed of two columns: sentence and label.
+
+### Pretrained-model
+
+Pretrained model can be found [here](https://projet.liris.cnrs.fr/hextgeo/files/trained_models/BERT_MODELS/)
+
+### Use BERT model 
+
+```python
+from lib.geocoder.bert_geocoder import BertGeocoder
+geocoder = BertGeocoder(<bert_model_dir>,<label_healpix_file>)
+geocoder.geocode(<toponyms>,<context,toponyms>)
+```
+
+<hr>
+
+# Train multiple model with different parameters
+
+We built a tiny module that allows to run the network training using different parameters. To do that use the GridSearchModel class in `lib.run`. You can find
+an example in the following code:
+
+```python
+from lib.run import GridSearchModel
+from collections import OrderedDict
+
+grid = GridSearchModel(\
+    "python3 train_geocoder_v2.py",
+    **OrderedDict({ # We use an OrderedDict since the order of parameters is important
+    "rel":["-i","-a","-c"],
+    "-n":[4],
+    "geoname_fn":"../data/geonamesData/US_FR.txt".split(),
+    "hierarchy_fn":"../data/geonamesData/hierarchy.txt".split(),
+    "store_true":["rel"]
+    }.items()))
+grid.run()
+```
+
+# Authors and Acknowledgment
+
+Proposed by **Jacques Fize**, **Ludovic Moncla** and **Bruno Martins**
+
+This research is supported by an IDEXLYON project of the University of Lyon within the framework of the Investments for the Future Program (ANR-16-IDEX-0005). Bruno Martins was supported by the Fundação para a Ciência e a Tecnologia (FCT), through the project grants PTDC/CCI-CIF/32607/2017 CMIMU) and UIBD/50021/2020 (INESC-ID multi-annual funding).
\ No newline at end of file
diff --git a/desamb_eval.py b/desamb_eval.py
new file mode 100644
index 0000000000000000000000000000000000000000..8fcd5febb04f7df4938dd5634949f15fc6ac3ce0
--- /dev/null
+++ b/desamb_eval.py
@@ -0,0 +1,66 @@
+from glob import glob
+import json
+
+import argparse
+import logging
+
+import pandas as pd
+
+
+
+parser = argparse.ArgumentParser()
+parser.add_argument("eval_dataset")
+parser.add_argument("models_directory")
+parser.add_argument("-g","--gpu",action="store_true")
+args = parser.parse_args()#("-g ../data/geocoding_evaluation/fr_cooc_test.csv outputs/FR_RESULT".split())
+
+
+if not args.gpu:
+    import os
+    os.environ['CUDA_VISIBLE_DEVICES'] = '-1' # No need for GPU
+
+
+from predict_toponym_coordinates import Geocoder
+from lib.utils_geo import haversine_pd
+
+logging.getLogger("tensorflow").setLevel(logging.CRITICAL)
+logging.getLogger("tensorflow_hub").setLevel(logging.CRITICAL)
+
+EVAL_DATASET_FN= args.eval_dataset#"./test_dataset_ambiguity.csv"
+
+
+def eval_model(eval_dataset_fn,model_fn,model_index_fn):
+    print("Dataset -- {0} -- Model -- {1}".format(\
+        eval_dataset_fn.split("/")[-1],
+        model_fn.split("/")[-1]))
+    df = pd.read_csv(eval_dataset_fn)
+    geocoder = Geocoder(model_fn,model_index_fn)
+    lon,lat = geocoder.get_coords(df.name1.values,df.name2.values)                  
+    lon,lat = geocoder.wgs_coord(lon,lat)
+
+    df["p_longitude"] = lon
+    df["p_latitude"] = lat
+
+    df["dist"] = haversine_pd(df.longitude,df.latitude,df.p_longitude,df.p_latitude)
+
+    print("100km",(df.dist<100).sum()/len(df))
+    print("50km",(df.dist<50).sum()/len(df))
+    print("20km",(df.dist<20).sum()/len(df))
+    return df
+
+prefixes = [x.rstrip(".h5") for x in glob(args.models_directory+"/*.h5")]
+
+final_output = []
+for prefix in prefixes:
+    try:
+        df = eval_model(EVAL_DATASET_FN,prefix + ".h5",prefix + "_index")
+        data = json.load(open(prefix+".json"))
+        data["acccuracy@100km"] = (df.dist<100).sum()/len(df)
+        data["acccuracy@50km"] = (df.dist<50).sum()/len(df)
+        data["acccuracy@25km"] = (df.dist<25).sum()/len(df)
+        final_output.append(data)
+    except:
+        print("BUMP!")
+    
+
+pd.DataFrame(final_output).to_csv("{0}_RESULT.csv".format(EVAL_DATASET_FN.rstrip(".csv")))
\ No newline at end of file
diff --git a/desamb_eval_runs.sh b/desamb_eval_runs.sh
new file mode 100644
index 0000000000000000000000000000000000000000..20eb682c6a8e00856a6d20d05100452d913b13d0
--- /dev/null
+++ b/desamb_eval_runs.sh
@@ -0,0 +1,3 @@
+python3 desamb_eval.py -g ../data/geocoding_evaluation/fr_dataset_ambiguity_sample50percent.csv outputs/FR_RESULT 
+python3 desamb_eval.py -g ../data/geocoding_evaluation/us_fr_cooc_test.csv outputs/USFR_WORD
+python3 desamb_eval.py -g ../data/geocoding_evaluation/us_fr_dataset_ambiguity.csv outputs/USFR_WORD
diff --git a/documentation/imgs/LSTM_archv2.png b/documentation/imgs/LSTM_archv2.png
new file mode 100644
index 0000000000000000000000000000000000000000..794909f8d568f9f153a774bba997d97f9d6a89be
Binary files /dev/null and b/documentation/imgs/LSTM_archv2.png differ
diff --git a/generate_dataset.py b/generate_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..788fe37c40f4822f0745f20f1795837fc74134f9
--- /dev/null
+++ b/generate_dataset.py
@@ -0,0 +1,149 @@
+import argparse
+
+import pandas as pd
+import numpy as np
+
+from helpers import read_geonames
+from lib.utils_geo import latlon2healpix
+
+from tqdm import tqdm
+from sklearn.model_selection import train_test_split
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument("geonames_dataset")
+parser.add_argument("wikipedia_dataset")
+parser.add_argument("geonames_hierarchy_data")
+parser.add_argument("--cooc-sampling", default=4, type=int)
+parser.add_argument("--adj-sampling", default=4, type=int)
+parser.add_argument("--adj-nside", default=128, type=int)
+parser.add_argument("--split-nside", default=128, type=int)
+parser.add_argument("--split-method", default="per_pair", type=str, choices="per_pair per_place".split())
+
+args = parser.parse_args()#("../data/geonamesData/FR.txt ../data/wikipedia/cooccurrence_FR.txt ../data/geonamesData/hierarchy.txt".split())
+
+PREFIX = args.geonames_dataset.split("/")[-1].split(".")[0]  # Ouch !
+PREFIX = PREFIX + "_" + args.split_method
+
+#  LOAD DATA
+geonames_data = read_geonames(args.geonames_dataset)
+wikipedia_data = pd.read_csv(args.wikipedia_dataset, sep="\t")
+geonames_hierarchy_data = pd.read_csv(args.geonames_hierarchy_data, sep="\t", header=None,
+                                      names="parentId,childId,type".split(",")).fillna("")
+
+# Add IDs for the Wikipedia Cooc Dataset
+min_id = geonames_data.geonameid.max() + 1
+max_id = min_id + len(wikipedia_data)
+wikipedia_data["geonameid"] = np.arange(min_id, max_id)
+
+#  Healpix cell id computation
+geonames_data["adj_split"] = geonames_data.apply(lambda x: latlon2healpix(x.latitude, x.longitude, args.adj_nside),
+                                                 axis=1)
+
+
+def get_adjacent_pairs(dataframe, sampling_nb):
+    """
+    Return pairs of place toponyms that are adjacent geographicaly.
+    Parameters
+    ----------
+    dataframe : pandas.DataFrame
+        geonames data
+    sampling_nb : int
+        number of adjacent place drawn
+
+    Returns
+    -------
+    list of list
+        [[ID,place toponym,adjacent place toponym, latitude, longitude],...]
+    """
+    new_pairs = []
+    for ix, row in tqdm(dataframe.iterrows(), total=len(dataframe), desc="Get Adjacent Toponym Pairs"):
+        healpix_cell = row.adj_split
+        topo_prin = row["name"]
+        lat, lon = row.latitude, row.longitude
+        within_cell = dataframe[dataframe.adj_split == healpix_cell]["name"].values
+        selected = np.random.choice(within_cell, sampling_nb)
+        new_pairs.extend([[row.geonameid, topo_prin, sel, lat, lon] for sel in selected])
+    return new_pairs
+
+
+def get_cooccurrence_pairs(dataframe, sampling_nb):
+    """
+    Return pairs of place toponyms where toponyms appears in a same wikipedia page
+    Parameters
+    ----------
+    dataframe : pandas.DataFrame
+        wikipedia cooccurrence data
+    sampling_nb : int
+        number of adjacent place drawn
+
+    Returns
+    -------
+    list of list
+        [[ID,place toponym,adjacent place toponym, latitude, longitude],...]
+    """
+    new_pairs = []
+    dataframe["interlinks"] = dataframe.interlinks.apply(lambda x: np.random.choice(x.split("|"), sampling_nb))
+    for ix, row in tqdm(dataframe.iterrows(), total=len(dataframe), desc="Get Cooccurrent Toponym Pairs"):
+        topo_prin = row.title
+        lat, lon = row.latitude, row.longitude
+        new_pairs.extend([[row.geonameid, topo_prin, sel, lat, lon] for sel in row["interlinks"]])
+    return new_pairs
+
+
+def get_inclusion_pairs(geoname_df, hierarchy_df):
+    """
+    Return pairs of place toponyms that share an inclusion relationship. Ex. Paris, France (Paris geometry is included in France geometry)
+    Parameters
+    ----------
+    dataframe : pandas.DataFrame
+        geonames data
+    hierarchy_df : pandas.DataFrame
+        geonames hierarchy data
+
+    Returns
+    -------
+    list of list
+        [[ID,place toponym,adjacent place toponym, latitude, longitude],...]
+    """
+    geonamesIDS = set(geoname_df.geonameid.values)
+    id_label = dict(geonames_data["geonameid name".split()].values)
+    id_lat = dict(geonames_data["geonameid latitude".split()].values)
+    id_lon = dict(geonames_data["geonameid longitude".split()].values)
+    filter_mask = (hierarchy_df.childId.isin(geonamesIDS) & hierarchy_df.parentId.isin(geonamesIDS))
+    pairs_id = hierarchy_df[filter_mask]["childId parentId".split()].values.tolist()
+    return [[p[0], id_label[p[0]], id_label[p[1]], id_lat[p[0]], id_lon[p[0]]] for p in pairs_id]
+
+# EXTRACT PAIRS FROM INPUT DATA
+cooc_pairs = pd.DataFrame(get_cooccurrence_pairs(wikipedia_data, args.cooc_sampling),
+                          columns="ID toponym toponym_context latitude longitude".split())
+adjacent_pairs = pd.DataFrame(get_adjacent_pairs(geonames_data, args.adj_sampling),
+                              columns="ID toponym toponym_context latitude longitude".split())
+inclusion_pairs = pd.DataFrame(get_inclusion_pairs(geonames_data, geonames_hierarchy_data),
+                               columns="ID toponym toponym_context latitude longitude".split())
+
+# FOR EACH PAIR, COMPUTE THE HEALPIX CELL ID FOR EACH COORDINATES ASSOCIATED
+cooc_pairs["hp_split"] = cooc_pairs.apply(lambda x: latlon2healpix(x.latitude, x.longitude, args.split_nside), axis=1)
+adjacent_pairs["hp_split"] = adjacent_pairs.apply(lambda x: latlon2healpix(x.latitude, x.longitude, args.split_nside),
+                                                  axis=1)
+inclusion_pairs["hp_split"] = inclusion_pairs.apply(lambda x: latlon2healpix(x.latitude, x.longitude, args.split_nside),
+                                                    axis=1)
+
+# SPLIT DATASETS BETWEEN TRAIN AND TEST GEOGRAPHICALY
+field = "hp_split"
+if args.split_method == "per_place":
+    field = "ID"
+
+for df in [cooc_pairs, adjacent_pairs]:
+    df_train, _ = train_test_split(df, stratify=df[field].values, test_size=0.33)
+    df["split"] = "test"
+    df.loc[df_train.index.values, "split"] = "train"
+
+inc_train, _ = train_test_split(inclusion_pairs, test_size=0.33)
+inclusion_pairs["split"] = "test"
+inclusion_pairs.loc[inc_train.index.values, "split"] = "train"
+
+# SAVE DATA
+inclusion_pairs.to_csv("{0}_inclusion.csv".format(PREFIX), sep="\t")
+adjacent_pairs.to_csv("{0}_adjacent.csv".format(PREFIX), sep="\t")
+cooc_pairs.to_csv("{0}_cooc.csv".format(PREFIX), sep="\t")
diff --git a/geocoder_app.py b/geocoder_app.py
new file mode 100644
index 0000000000000000000000000000000000000000..81d4a1e0e7ce7713d00e47ca4f862b807e3dee3a
--- /dev/null
+++ b/geocoder_app.py
@@ -0,0 +1,93 @@
+from flask import Flask, escape, request, render_template,jsonify,Markup, redirect, url_for
+from lib.geocoder.our_geocoder import Geocoder,TextGeocoder
+from lib.geocoder.heuristics import *
+
+import spacy
+
+app = Flask(__name__)
+
+dict_model = {
+    "FR_AIC":("./outputs/FR_MODEL_2/FR.txt_100_4_100__A_I_C.h5","./outputs/FR_MODEL_2/FR.txt_100_4_100__A_I_C_index"),
+    "FR_C":("./outputs/FR_MODEL_2/FR.txt_100_4_100__C.h5","./outputs/FR_MODEL_2/FR.txt_100_4_100__C_index"),
+    "FR_AC":("./outputs/FR_MODEL_2/FR.txt_100_4_100__A_C.h5","./outputs/FR_MODEL_2/FR.txt_100_4_100__A_C_index"),
+    "FR_IC":("./outputs/FR_MODEL_2/FR.txt_100_4_100__I_C.h5","./outputs/FR_MODEL_2/FR.txt_100_4_100__I_C_index"),
+
+    "GB_AIC":("./outputs/GB_MODEL_2/GB.txt_100_4_100__A_I_C.h5","./outputs/GB_MODEL_2/GB.txt_100_4_100__A_I_C_index"),
+    "GB_C":("./outputs/GB_MODEL_2/GB.txt_100_4_100__C.h5","./outputs/GB_MODEL_2/GB.txt_100_4_100__C_index"),
+    "GB_AC":("./outputs/GB_MODEL_2/GB.txt_100_4_100__A_C.h5","./outputs/GB_MODEL_2/GB.txt_100_4_100__A_C_index"),
+    "GB_IC":("./outputs/GB_MODEL_2/GB.txt_100_4_100__I_C.h5","./outputs/GB_MODEL_2/GB.txt_100_4_100__I_C_index")
+    ,"FR_IGN":("./outputs/IGN/onlyAdjac/IGN_4_100_A_C.h5","./outputs/IGN/onlyAdjac/IGN_4_100_A_C_index")
+}
+
+MODEL = "FR_AC"
+LANG = "fr"
+NER = "spacy"
+
+heuristic_func = heuristic_cluster
+
+geocoder = Geocoder(*dict_model[MODEL])
+g_t = TextGeocoder(geocoder,NER,LANG,heuristic_func)
+
+
+@app.route('/')
+def home():
+    toponym = request.args.get("top", "")
+    c_toponym = request.args.get("c_top", "")
+    msg = request.args.get("msg", "")
+    msg_code = request.args.get("msg_code", "info")
+    if toponym and c_toponym:
+        lon,lat = geocoder.get_coords([toponym],[c_toponym])
+        lon,lat = lon[0],lat[0]
+        print(lon,lat)
+        return  render_template("pair_topo.html",lat=lat,lon=lon,title="Toponyms Pair Geocoder",dict_model=dict_model,msg_code=msg_code)
+    else:
+        return  render_template("pair_topo.html",title="Toponyms Pair Geocoder",dict_model=dict_model,msg_code=msg_code)
+
+@app.route('/text')
+def text():
+    return render_template("text.html",title="Text Geocoder",dict_model=dict_model)
+
+@app.route('/geocode', methods=['POST', 'GET'])
+def geocode():
+    if request.method == 'POST':
+        text = request.form["text"]
+        
+        results = g_t.geocode(g_t.extract_geo_entities(text))
+        
+        html_, pos_ = "", 0
+        for item in results:
+            start,end = item["start"], item["end"]
+            html_ = html_ + text[pos_:start] + "<span class=\"annotation place\">{0}</span>".format(text[start:end])
+            pos_ = end
+        
+        place_coords = {}
+        for r in results:
+            if r["text"] in place_coords:
+                continue
+            place_coords[r["text"]]={"lat":float(r["coord"]["lat"]),"lon":float(r["coord"]["lon"])}
+        return render_template("text.html",title="Text Geocoder",data={"type":"success","output":Markup(html_),"place_coords":place_coords},dict_model=dict_model)
+
+
+@app.route("/loadmodel/<model_id>")
+def loadModel(model_id):
+    global geocoder,g_t,LANG
+    if not model_id in dict_model:
+        return redirect(url_for(".home",msg="An error happend when loading the model \"{0}\"!".format(model_id),msg_code="danger"))
+    else: 
+        geocoder = Geocoder(*dict_model[model_id])
+        g_t = TextGeocoder(geocoder,NER,LANG,heuristic_func)
+        return redirect(url_for(".home",msg="Model \"{0}\" was loaded successfuly!".format(model_id),msg_code="success"))
+
+@app.route("/loadlang/<lang>")
+def loadLang(lang):
+    global geocoder,g_t,LANG
+    try:
+        g_t = TextGeocoder(geocoder,NER,lang,heuristic_func)
+        LANG = lang
+        return redirect(url_for(".home",msg="Language is now set to \"{0}\"!".format(LANG),msg_code="success"))
+    except:
+        return redirect(url_for(".home",msg="\"{}\" language is not available!".format(lang),msg_code="danger"))
+        
+
+if __name__ == "__main__":
+    app.run(host="0.0.0.0",debug=True)
\ No newline at end of file
diff --git a/helpers.py b/helpers.py
new file mode 100644
index 0000000000000000000000000000000000000000..b093f62f25a9dce7913cca5373141f13f1753554
--- /dev/null
+++ b/helpers.py
@@ -0,0 +1,201 @@
+import os
+import time
+import re
+
+import numpy as np
+import pandas as pd
+
+
+def read_geonames(file):
+    """
+    Return a dataframe that contains Geonames data.
+    
+    Parameters
+    ----------
+    file : str
+        path of the Geonames Csv file
+    
+    Returns
+    -------
+    pd.DataFrame
+        geonames data
+    """
+    dtypes_dict = {
+        0: int,  # geonameid
+        1: str,  # name
+        2: str,  # asciiname
+        3: str,  # alternatenames
+        4: float,  # latitude
+        5: float,  # longitude
+        6: str,  # feature class
+        7: str,  # feature code
+        8: str,  # country code
+        9: str,  # cc2
+        10: str,  # admin1 code
+        11: str,  # admin2 code
+        12: str,  # admin3 code
+        13: str,  # admin4 code
+        14: int,  # population
+        15: str,  # elevation
+        16: int,  # dem (digital elevation model)
+        17: str,  # timezone
+        18: str,  # modification date yyyy-MM-dd
+    }
+    rename_cols = {
+        0: "geonameid",  # geonameid
+        1: "name",  # name
+        2: "asciiname",  # asciiname
+        3: "alternatenames",  # alternatenames
+        4: "latitude",  # latitude
+        5: "longitude",  # longitude
+        6: "feature_class",  # feature class
+        7: "feature_code",  # feature code
+        8: "country_code",  # country code
+        9: "cc2",  # cc2
+        10: "admin1_code",  # admin1 code
+        11: "admin2_code",  # admin2 code
+        12: "admin3_code",  # admin3 code
+        13: "admin4_code",  # admin4 code
+        14: "population",  # population
+        15: "elevation",  # elevation
+        16: "dem",  # dem (digital elevation model)
+        17: "timezone",  # timezone
+        18: "modification_date",  # modification date yyyy-MM-dd
+    }
+    data = pd.read_csv(
+        file,
+        sep="\t",
+        header=None,
+        quoting=3,
+        dtype=dtypes_dict,
+        na_values="",
+        keep_default_na=False,
+        error_bad_lines=False,
+    )
+    data.rename(columns=rename_cols, inplace=True)
+    return data
+
+
+def parse_title_wiki(title_wiki):
+    """
+    Parse Wikipedia title
+    
+    Parameters
+    ----------
+    title_wiki : str
+        wikipedia title
+    
+    Returns
+    -------
+    str
+        parsed wikipedia title
+    """
+    return re.sub("\(.*\)", "", str(title_wiki)).strip().lower()
+
+
+def _split(lst, n, complete_chunk_value):
+    """
+    Split a list into chunk of n-size.
+    
+    Parameters
+    ----------
+    lst : list
+        input list
+    n : int
+        chunk size
+    complete_chunk_value : object
+        if last chunk size not equal to n, this value is used to complete it
+    
+    Returns
+    -------
+    list
+        chunked list
+    """
+    chunks = [lst[i : i + n] for i in range(0, len(lst), n)]
+    if not chunks:
+        return chunks
+    if len(chunks[-1]) != n:
+        chunks[-1].extend([complete_chunk_value] * (n - len(chunks[-1])))
+    return np.array(chunks)
+
+
+class Chronometer:
+    def __init__(self):
+        self.__task_begin_timestamp = {}
+
+    def start(self, task_name):
+        """
+        Start a new task chronometer
+        
+        Parameters
+        ----------
+        task_name : str
+            task id
+        
+        Raises
+        ------
+        ValueError
+            if a running task already exists with that name
+        """
+        if task_name in self.__task_begin_timestamp:
+            raise ValueError(
+                "A running task exists with the name {0}!".format(task_name)
+            )
+        self.__task_begin_timestamp[task_name] = time.time()
+
+    def stop(self, task_name):
+        """
+        Stop and return the duration of the task
+        
+        Parameters
+        ----------
+        task_name : str
+            task id
+        
+        Returns
+        -------
+        float
+            duration of the task in seconds
+        
+        Raises
+        ------
+        ValueError
+            if no task exist with the id `task_name`
+        """
+        if not task_name in self.__task_begin_timestamp:
+            raise ValueError("The {0} task does not exist!".format(task_name))
+        
+        duration = time.time() - self.__task_begin_timestamp[task_name]
+        del self.__task_begin_timestamp[task_name]
+
+        return duration
+
+
+from keras.callbacks import Callback
+import time
+
+class EpochTimer(Callback):
+    def __init__(self,log_filename):
+        self.epoch = 0
+        self.timer = time.time()
+        self.output = open(log_filename,'w')
+        self.output.write("{0},{1}\n".format("Epoch","Execution Time"))
+        self.output.flush()
+
+    def on_epoch_begin(self,epoch, logs={}):
+        self.timer = time.time()
+
+    def on_epoch_end(self, epoch, logs=None):
+        end_time = time.time() - self.timer
+        self.output.write("{0},{1}\n".format(self.epoch,end_time))
+        self.output.flush()
+        self.epoch += 1 
+
+if __name__ == "__main__":
+    chrono = Chronometer()
+    chrono.start("test")
+    chrono.start("test2")
+    time.sleep(3)
+    print(chrono.stop("test"))
+    time.sleep(3)
+    print(chrono.stop("test2"))
diff --git a/lib/__init__.py b/lib/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/lib/custom_layer.py b/lib/custom_layer.py
new file mode 100644
index 0000000000000000000000000000000000000000..72045734ef4951b1f0e4585d27575ed0aa3a7228
--- /dev/null
+++ b/lib/custom_layer.py
@@ -0,0 +1,31 @@
+# This Layer implementation comes from 
+
+import keras
+from keras import backend as K
+from keras.engine.topology import Layer
+
+
+
+class Pentanh(Layer):
+    """
+    Implementation for the "Penalized Tanh" activation function presented in :
+        Xu, Bing, Ruitong Huang, et Mu Li. « Revise saturated activation functions ». arXiv preprint arXiv:1602.05980, 2016.
+    
+    Code Author: Ana Bárbara Cardoso https://github.com/barbarainacioc/toponym-resolution/blob/master/system/nn_model.py
+    """
+
+    def __init__(self, **kwargs):
+        super(Pentanh, self).__init__(**kwargs)
+        self.supports_masking = True
+        self.__name__ = 'pentanh'
+
+    def call(self, inputs):
+        return K.switch(K.greater(inputs,0), K.tanh(inputs), 0.25 * K.tanh(inputs))
+
+    def get_config(self):
+        return super(Pentanh, self).get_config()
+
+    def compute_output_shape(self, input_shape):
+        return input_shape
+
+keras.utils.generic_utils.get_custom_objects().update({'pentanh': Pentanh()})
\ No newline at end of file
diff --git a/lib/data_generator.py b/lib/data_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..7eae387e651be140976239112403b9b1a47e855c
--- /dev/null
+++ b/lib/data_generator.py
@@ -0,0 +1,347 @@
+import os
+from gzip import GzipFile
+
+import keras
+from keras.utils import to_categorical
+import numpy as np
+import pandas as pd
+
+from .utils_geo import zero_one_encoding
+
+from helpers import parse_title_wiki,read_geonames
+from gensim.models.keyedvectors import KeyedVectors
+
+from sklearn.preprocessing import LabelEncoder
+
+
+def wc_l(filename,gzip=True):
+    lc = 0
+    if not gzip:
+        f = open(filename)
+    if gzip:
+        f = GzipFile(filename)
+    while f.readline():
+        lc += 1 
+    f.close()       
+    return lc
+
+class SamplingProbabilities:
+    def __init__(self):
+        self.count = {}
+    
+    def get_probs(self,item):
+        if not item in self.count:
+            self.count[item] = 0
+        self.count[item]+=1
+        return 1/self.count[item]
+    def __call__(self,a):
+        return self.get_probs(a)
+        
+
+class DataSource(object):
+    def __init__(self,name,input_filename):
+        self.name = name
+        assert os.path.exists(input_filename)
+        self.input_filename = input_filename
+        self.len = 0
+
+        self.is_there_healpix = False
+
+    def __next__(self):
+        raise NotImplementedError()
+
+    def __iter__(self):
+        return self
+    
+    def __len__(self):
+        return self.len
+    
+    def __reset__(self):
+        raise NotImplementedError()
+
+    def isOver(self):
+        raise NotImplementedError()
+
+class Adjacency(DataSource):
+    def __init__(self,filename,geonames_filename,sampling=3,len_=None,gzip=True):
+        DataSource.__init__(self,"Adjacency SRC",filename)
+
+        assert os.path.exists(geonames_filename)
+        self.geonames_data_dict = {row.geonameid:row.name for row in read_geonames(geonames_filename).itertuples()}
+        
+        self.gzip = gzip
+        if not self.gzip:
+            self.data_src = open(self.input_filename,'rb')
+        else:
+            self.data_src = GzipFile(self.input_filename,'rb')
+
+        if len_:
+            self.len = len_*sampling
+        else:
+            self.len = wc_l(filename,gzip=gzip)
+        
+        self.data_src.readline() # header line
+
+        self.sampling = sampling
+        if self.sampling:
+            self.probs_storage = SamplingProbabilities() 
+
+        self.topo = None
+        self.context_topo_context = []
+        self.curr_probs = None
+        self.lat, self.lon = None, None
+
+
+        self.i = 0
+        self.is_over = False
+    
+    def __next__(self):
+        if  self.i >= len(self.context_topo_context):
+            line = self.data_src.readline()
+            if not line:
+                self.is_over = True
+                raise StopIteration
+            line = line.decode("utf-8").rstrip("\n")
+            _,geonameid, adjacent_geoname_id,latitude,longitude = tuple(line.split(","))
+
+            self.topo = int(geonameid)
+            self.context_topo_context = [int(x) for x in adjacent_geoname_id.split("|")]
+            if self.sampling:
+                self.curr_probs = [self.probs_storage(x) for x in self.context_topo_context]
+                self.context_topo_context = np.random.choice(self.context_topo_context,self.sampling,self.curr_probs)
+            self.lat, self.lon = float(latitude),float(longitude)
+
+            self.i = 0
+        
+        self.i += 1
+        return (self.geonames_data_dict[self.topo],
+        self.geonames_data_dict[self.context_topo_context[self.i-1]],
+        self.lat,self.lon)
+
+    def __reset__(self):
+        if not self.gzip:
+            self.data_src = open(self.input_filename,'rb')
+        else:
+            self.data_src = GzipFile(self.input_filename,'rb')
+
+        self.data_src.readline() # header line
+        self.is_over = False
+
+    def isOver(self):
+        return self.is_over
+
+
+class Inclusion(DataSource):
+    def __init__(self, geonames_filename,hierarchy_filename,mask_ids=None):
+        super().__init__("Inclusion SRC",hierarchy_filename)
+        assert os.path.exists(geonames_filename)
+        self.geonames_data_dict = {row.geonameid:(row.name,row.latitude,row.longitude) for row in read_geonames(geonames_filename).itertuples()}
+        
+        self.data_src = pd.read_csv(self.input_filename,
+            sep="\t",
+            header=None,
+            names="parentId,childId,type".split(",")
+        ).fillna("")
+        
+        if mask_ids:
+            self.data_src = self.data_src[self.data_src.childId.isin(mask_ids)]
+        self.data_src= self.data_src[self.data_src.childId.isin(self.geonames_data_dict)]
+        self.data_src= self.data_src[self.data_src.parentId.isin(self.geonames_data_dict)]
+
+        self.data_src = self.data_src["childId parentId".split()].values.tolist()
+        self.len = len(self.data_src)
+
+        self.i = 0
+
+        self.is_over = False
+
+    def __next__(self):
+        if self.i+1 >= self.len:
+            self.eof = True
+            raise StopIteration
+        else:
+            self.i += 1
+            tup_ = tuple(self.data_src[self.i-1])
+            return (self.geonames_data_dict[tup_[0]][0],
+            self.geonames_data_dict[tup_[1]][0],
+            self.geonames_data_dict[tup_[0]][2],
+            self.geonames_data_dict[tup_[0]][1])
+
+    def __reset__(self):
+        self.i = 0
+        self.is_over = False
+    
+    def isOver(self):
+        return (self.i == self.len)
+    
+
+
+
+class CoOccurrences(DataSource):
+    def __init__(self, filename, label_encoder,sampling=3,resolution = 256,use_healpix=False):
+        super().__init__("Co-Occurrence data",filename)
+        self.is_there_healpix = use_healpix
+        # LOAD DATA
+
+        self.data_src = pd.read_csv(filename,sep="\t")
+
+        # CHECK IF THE HEALPIX RESOLUTION DATA APPEARS IN THE DATA
+        if not "healpix_{0}".format(resolution) in self.data_src.columns:
+            raise KeyError("healpix_{0} column does not exists ! ".format(resolution))
+        
+        # PARSE TOPONYMS
+        self.data_src["title"] = self.data_src.title.apply(parse_title_wiki)
+        try:
+            self.data_src["interlinks"] = self.data_src.interlinks.apply(parse_title_wiki)
+        except:
+            pass
+
+        # LOOP parameter
+        self.sampling = sampling
+        if self.sampling:
+            self.probs_storage = SamplingProbabilities()
+            
+        # LOOP INDICES
+        self.i = 0
+        self.j = 0
+        self.is_over = False
+        self.len = len(self.data_src)*self.sampling
+
+        
+        # BUFFER VARIABLE
+        self.topo = None
+        self.context_topo_context = []
+        self.curr_probs = None
+        self.lat, self.lon = None, None
+
+
+        self.resolution = resolution
+        self.classes = self.data_src["healpix_{0}".format(self.resolution)].unique().tolist()
+
+        self.class_encoder = label_encoder
+        self.class_encoder.fit(self.classes)
+
+        self.healpix = None
+
+    def __next__(self):
+        if self.isOver() or self.i*self.sampling == self.len:
+            self.is_over = True
+            raise StopIteration 
+
+        if  self.j >= len(self.context_topo_context):
+            line = self.data_src.iloc[self.i]
+            
+            self.topo = line.title
+            self.context_topo_context = [x for x in str(line.interlinks).split("|")]
+            if self.sampling:
+                self.curr_probs = [self.probs_storage(x) for x in self.context_topo_context]
+                self.context_topo_context = np.random.choice(self.context_topo_context,self.sampling,self.curr_probs)
+            self.lat, self.lon = line.latitude,line.longitude
+            
+            self.healpix = line["healpix_{0}".format(self.resolution)]
+            
+            self.i += 1
+            self.j = 0
+        
+        self.j += 1
+        return (self.topo,
+        self.context_topo_context[self.j-1],
+        self.lat,self.lon,self.class_encoder.transform([self.healpix])[0])
+
+    def __reset__(self):
+        self.i = 0
+        self.is_over = False
+    
+    def isOver(self):
+        return self.is_over
+    
+class DataGenerator(keras.utils.Sequence):
+    'Generates data for Keras'
+    def __init__(self,data_sources,ngram_index,class_encoder,**kwargs):
+        'Initialization'
+        self.data_src = data_sources
+        self.ngram_index = ngram_index
+
+        self.batch_size = kwargs.get("batch_size",1000)
+        self.only_healpix = kwargs.get("only_healpix",False)
+        
+        self.len = sum([len(d) for d in self.data_src])
+        self.datasrc_index = 0
+
+        self.num_classes = class_encoder.get_num_classes()
+
+        self.is_there_healpix = self.data_src[self.datasrc_index].is_there_healpix
+    def __len__(self):
+        'Denotes the number of batches per epoch'
+        return int(np.floor(self.len / self.batch_size))
+
+    def return_(self,X,y,y2=None):
+        if self.is_there_healpix and self.only_healpix:
+            return [X[:,0],X[:,1]],y2
+
+        elif self.is_there_healpix:
+            return [X[:,0],X[:,1]],[y,y2]
+        else:
+            return [X[:,0],X[:,1]],y
+
+    def __getitem__(self, index):
+        'Generate one batch of data'
+        X = np.empty((self.batch_size,2,self.ngram_index.max_len),dtype=np.int32) # toponym
+        y = np.empty((self.batch_size,2),dtype=float) #lat lon coord
+
+        y2=None # For healpix
+        if self.is_there_healpix:
+            y2 = np.empty((self.batch_size,self.num_classes),dtype=float) # healpix class
+
+        if self.data_src[self.datasrc_index].isOver():
+                self.datasrc_index += 1
+                self.is_there_healpix = self.data_src[self.datasrc_index].is_there_healpix
+
+        if self.datasrc_index >= len(self.data_src):
+            self.return_(X,y,y2)
+        
+        for i in range(self.batch_size):
+            if self.data_src[self.datasrc_index].isOver():
+                return self.return_(X,y,y2)
+            try:
+                topo, topo_context, latitude, longitude, healpix_class = self.data_src[self.datasrc_index].__next__()
+            except StopIteration as e:
+                return self.return_(X,y,y2)
+            
+            X[i] = [ self.ngram_index.encode(topo),self.ngram_index.encode(topo_context)]
+            y[i] =  [*zero_one_encoding(longitude,latitude)]
+            if self.is_there_healpix:
+                y2[i] = to_categorical(healpix_class, num_classes=self.num_classes, dtype='int32'
+)
+
+            #y[i] = [longitude,latitude]
+        return self.return_(X,y,y2)
+
+    def on_epoch_end(self):
+        'Updates indexes after each epoch'
+        [d.__reset__() for d in self.data_src]
+        self.datasrc_index = 0
+
+
+    
+def load_embedding(model_fn,dim_vector=100):
+    model = KeyedVectors.load(model_fn)
+    N = len(model.wv.vocab)
+    M = np.zeros((N,dim_vector))
+    for i in range(N):
+        try:
+            M[i] = model.wv[str(i)]
+        except KeyError:
+            pass
+    return M
+
+if __name__ == "__main__":
+    # All adj nb of line :7955000-1
+    from lib.ngram_index import NgramIndex
+    from tqdm import tqdm
+    ng = NgramIndex.load("../data/embeddings/word2vec4gram/4gramWiki+geonames_index.json")
+    c= CoOccurrences("../data/wikipedia/cooccurrence_FR.txt_test.csv",sampling=3)
+    a = Adjacency("/home/jacques/sample_adjacency.txt",geonames_filename="../data/geonamesData/allCountries.txt",gzip=False,sampling=10)
+    i= Inclusion(geonames_filename="../data/geonamesData/allCountries.txt",hierarchy_filename="../data/geonamesData/hierarchy.txt")
+    d= DataGenerator([c,a,i],ng) 
+    for x in tqdm(range(len(d))):d[i]
diff --git a/lib/data_generatorv3.py b/lib/data_generatorv3.py
new file mode 100644
index 0000000000000000000000000000000000000000..9bd88db97206808f1bc10db7b5c9a3e182de194b
--- /dev/null
+++ b/lib/data_generatorv3.py
@@ -0,0 +1,354 @@
+import os
+from gzip import GzipFile
+
+import keras
+from keras.utils import to_categorical
+import numpy as np
+import pandas as pd
+
+from .utils_geo import zero_one_encoding
+
+from helpers import parse_title_wiki,read_geonames
+from gensim.models.keyedvectors import KeyedVectors
+
+from sklearn.preprocessing import LabelEncoder
+
+
+def wc_l(filename,gzip=True):
+    lc = 0
+    if not gzip:
+        f = open(filename)
+    if gzip:
+        f = GzipFile(filename)
+    while f.readline():
+        lc += 1 
+    f.close()       
+    return lc
+
+class SamplingProbabilities:
+    def __init__(self):
+        self.count = {}
+    
+    def get_probs(self,item):
+        if not item in self.count:
+            self.count[item] = 0
+        self.count[item]+=1
+        return 1/self.count[item]
+    def __call__(self,a):
+        return self.get_probs(a)
+        
+
+class DataSource(object):
+    def __init__(self,name,input_filename):
+        self.name = name
+        assert os.path.exists(input_filename)
+        self.input_filename = input_filename
+        self.len = 0
+
+        self.is_there_healpix = False
+
+    def __next__(self):
+        raise NotImplementedError()
+
+    def __iter__(self):
+        return self
+    
+    def __len__(self):
+        return self.len
+    
+    def __reset__(self):
+        raise NotImplementedError()
+
+    def isOver(self):
+        raise NotImplementedError()
+
+class Adjacency(DataSource):
+    def __init__(self,filename,geonames_filename,sampling=3,len_=None,gzip=True):
+        DataSource.__init__(self,"Adjacency SRC",filename)
+
+        assert os.path.exists(geonames_filename)
+        self.geonames_data_dict = {row.geonameid:row.name for row in read_geonames(geonames_filename).itertuples()}
+        
+        self.gzip = gzip
+        if not self.gzip:
+            self.data_src = open(self.input_filename,'rb')
+        else:
+            self.data_src = GzipFile(self.input_filename,'rb')
+
+        if len_:
+            self.len = len_*sampling
+        else:
+            self.len = wc_l(filename,gzip=gzip)
+        
+        self.data_src.readline() # header line
+
+        self.sampling = sampling
+        if self.sampling:
+            self.probs_storage = SamplingProbabilities() 
+
+        self.topo = None
+        self.context_topo_context = []
+        self.curr_probs = None
+        self.lat, self.lon = None, None
+
+
+        self.i = 0
+        self.is_over = False
+    
+    def __next__(self):
+        if  self.i >= len(self.context_topo_context):
+            line = self.data_src.readline()
+            if not line:
+                self.is_over = True
+                raise StopIteration
+            line = line.decode("utf-8").rstrip("\n")
+            _,geonameid, adjacent_geoname_id,latitude,longitude = tuple(line.split(","))
+
+            self.topo = int(geonameid)
+            self.context_topo_context = [int(x) for x in adjacent_geoname_id.split("|")]
+            if self.sampling:
+                self.curr_probs = [self.probs_storage(x) for x in self.context_topo_context]
+                self.context_topo_context = np.random.choice(self.context_topo_context,self.sampling,self.curr_probs)
+            self.lat, self.lon = float(latitude),float(longitude)
+
+            self.i = 0
+        
+        self.i += 1
+        return (self.geonames_data_dict[self.topo],
+        self.geonames_data_dict[self.context_topo_context[self.i-1]],
+        self.lat,self.lon)
+
+    def __reset__(self):
+        if not self.gzip:
+            self.data_src = open(self.input_filename,'rb')
+        else:
+            self.data_src = GzipFile(self.input_filename,'rb')
+
+        self.data_src.readline() # header line
+        self.is_over = False
+
+    def isOver(self):
+        return self.is_over
+
+
+class Inclusion(DataSource):
+    def __init__(self, geonames_filename,hierarchy_filename,mask_ids=None):
+        super().__init__("Inclusion SRC",hierarchy_filename)
+        assert os.path.exists(geonames_filename)
+        self.geonames_data_dict = {row.geonameid:(row.name,row.latitude,row.longitude) for row in read_geonames(geonames_filename).itertuples()}
+        
+        self.data_src = pd.read_csv(self.input_filename,
+            sep="\t",
+            header=None,
+            names="parentId,childId,type".split(",")
+        ).fillna("")
+        
+        if mask_ids:
+            self.data_src = self.data_src[self.data_src.childId.isin(mask_ids)]
+        self.data_src= self.data_src[self.data_src.childId.isin(self.geonames_data_dict)]
+        self.data_src= self.data_src[self.data_src.parentId.isin(self.geonames_data_dict)]
+
+        self.data_src = self.data_src["childId parentId".split()].values.tolist()
+        self.len = len(self.data_src)
+
+        self.i = 0
+
+        self.is_over = False
+
+    def __next__(self):
+        if self.i+1 >= self.len:
+            self.eof = True
+            raise StopIteration
+        else:
+            self.i += 1
+            tup_ = tuple(self.data_src[self.i-1])
+            return (self.geonames_data_dict[tup_[0]][0],
+            self.geonames_data_dict[tup_[1]][0],
+            self.geonames_data_dict[tup_[0]][2],
+            self.geonames_data_dict[tup_[0]][1])
+
+    def __reset__(self):
+        self.i = 0
+        self.is_over = False
+    
+    def isOver(self):
+        return (self.i == self.len)
+    
+
+
+
+class CoOccurrences(DataSource):
+    def __init__(self, filename, label_encoder,sampling=3,resolution = 256,use_healpix=False):
+        super().__init__("Co-Occurrence data",filename)
+        self.is_there_healpix = use_healpix
+        # LOAD DATA
+
+        self.data_src = pd.read_csv(filename,sep="\t")
+
+        # CHECK IF THE HEALPIX RESOLUTION DATA APPEARS IN THE DATA
+        if not "healpix_{0}".format(resolution) in self.data_src.columns:
+            raise KeyError("healpix_{0} column does not exists ! ".format(resolution))
+        
+        # PARSE TOPONYMS
+        self.data_src["title"] = self.data_src.title.apply(parse_title_wiki)
+        try:
+            self.data_src["interlinks"] = self.data_src.interlinks.apply(parse_title_wiki)
+        except:
+            pass
+
+        # LOOP parameter
+        self.sampling = sampling
+        if self.sampling:
+            self.probs_storage = SamplingProbabilities()
+            
+        # LOOP INDICES
+        self.i = 0
+        self.j = 0
+        self.is_over = False
+        self.len = len(self.data_src)*(self.sampling-1)
+
+        
+        # BUFFER VARIABLE
+        self.topo = None
+        self.context_topo_context = []
+        self.curr_probs = None
+        self.lat, self.lon = None, None
+
+
+        self.resolution = resolution
+        self.classes = self.data_src["healpix_{0}".format(self.resolution)].unique().tolist()
+
+        self.class_encoder = label_encoder
+        self.class_encoder.fit(self.classes)
+
+        self.healpix = None
+
+    def __next__(self):
+        if self.isOver() or self.i*self.sampling == self.len:
+            self.is_over = True
+            raise StopIteration 
+
+        if  self.j >= len(self.context_topo_context):
+            line = self.data_src.iloc[self.i]
+            
+            self.topo = line.title
+            self.context_topo_context = [x for x in str(line.interlinks).split("|")]
+            N = len(self.context_topo_context)
+            triple = []
+            for i in range(N):
+                if i+1 == N:
+                    break
+                triple.append((self.context_topo_context[i],self.context_topo_context[i+1]))
+                
+
+            self.context_topo_context = triple
+            np.random.shuffle(self.context_topo_context)
+            self.lat, self.lon = line.latitude,line.longitude
+            
+            self.healpix = line["healpix_{0}".format(self.resolution)]
+            
+            self.i += 1
+            self.j = 0
+        
+        self.j += 1
+        return (self.topo,
+        *self.context_topo_context[self.j-1],
+        self.lat,self.lon,self.class_encoder.transform([self.healpix])[0])
+
+    def __reset__(self):
+        self.i = 0
+        self.is_over = False
+    
+    def isOver(self):
+        return self.is_over
+    
+class DataGenerator(keras.utils.Sequence):
+    'Generates data for Keras'
+    def __init__(self,data_sources,ngram_index,class_encoder,**kwargs):
+        'Initialization'
+        self.data_src = data_sources
+        self.ngram_index = ngram_index
+
+        self.batch_size = kwargs.get("batch_size",1000)
+        self.only_healpix = kwargs.get("only_healpix",False)
+        
+        self.len = sum([len(d) for d in self.data_src])
+        self.datasrc_index = 0
+
+        self.num_classes = class_encoder.get_num_classes()
+
+        self.is_there_healpix = self.data_src[self.datasrc_index].is_there_healpix
+    def __len__(self):
+        'Denotes the number of batches per epoch'
+        return int(np.floor(self.len / self.batch_size))
+
+    def return_(self,X,y,y2=None):
+        if self.is_there_healpix and self.only_healpix:
+            return [X[:,0],X[:,1],X[:,2]],y2
+
+        elif self.is_there_healpix:
+            return [X[:,0],X[:,1],X[:,2]],[y,y2]
+        else:
+            return [X[:,0],X[:,1],X[:,2]],y
+
+    def __getitem__(self, index):
+        'Generate one batch of data'
+        X = np.empty((self.batch_size,3,self.ngram_index.max_len),dtype=np.int32) # toponym
+        y = np.empty((self.batch_size,2),dtype=float) #lat lon coord
+
+        y2=None # For healpix
+        if self.is_there_healpix:
+            y2 = np.empty((self.batch_size,self.num_classes),dtype=float) # healpix class
+
+        if self.data_src[self.datasrc_index].isOver():
+                self.datasrc_index += 1
+                self.is_there_healpix = self.data_src[self.datasrc_index].is_there_healpix
+
+        if self.datasrc_index >= len(self.data_src):
+            self.return_(X,y,y2)
+        
+        for i in range(self.batch_size):
+            if self.data_src[self.datasrc_index].isOver():
+                return self.return_(X,y,y2)
+            try:
+                topo, topo_context_1,topo_context_2, latitude, longitude, healpix_class = self.data_src[self.datasrc_index].__next__()
+            except StopIteration as e:
+                return self.return_(X,y,y2)
+            
+            X[i] = [ self.ngram_index.encode(topo),self.ngram_index.encode(topo_context_1),self.ngram_index.encode(topo_context_2)]
+            y[i] =  [*zero_one_encoding(longitude,latitude)]
+            if self.is_there_healpix:
+                y2[i] = to_categorical(healpix_class, num_classes=self.num_classes, dtype='int32'
+)
+
+            #y[i] = [longitude,latitude]
+        return self.return_(X,y,y2)
+
+    def on_epoch_end(self):
+        'Updates indexes after each epoch'
+        [d.__reset__() for d in self.data_src]
+        self.datasrc_index = 0
+
+
+    
+def load_embedding(model_fn,dim_vector=100):
+    model = KeyedVectors.load(model_fn)
+    N = len(model.wv.vocab)
+    M = np.zeros((N,dim_vector))
+    for i in range(N):
+        try:
+            M[i] = model.wv[str(i)]
+        except KeyError:
+            pass
+    return M
+
+if __name__ == "__main__":
+    # All adj nb of line :7955000-1
+    from lib.ngram_index import NgramIndex
+    from tqdm import tqdm
+    ng = NgramIndex.load("../data/embeddings/word2vec4gram/4gramWiki+geonames_index.json")
+    c= CoOccurrences("../data/wikipedia/cooccurrence_FR.txt_test.csv",sampling=3)
+    a = Adjacency("/home/jacques/sample_adjacency.txt",geonames_filename="../data/geonamesData/allCountries.txt",gzip=False,sampling=10)
+    i= Inclusion(geonames_filename="../data/geonamesData/allCountries.txt",hierarchy_filename="../data/geonamesData/hierarchy.txt")
+    d= DataGenerator([c,a,i],ng) 
+    for x in tqdm(range(len(d))):d[i]
diff --git a/lib/datageneratorv4.py b/lib/datageneratorv4.py
new file mode 100644
index 0000000000000000000000000000000000000000..e451035084fdbdb8f6a0ea0d5c799a35a63cdfd5
--- /dev/null
+++ b/lib/datageneratorv4.py
@@ -0,0 +1,65 @@
+import os
+from gzip import GzipFile
+
+import keras
+from keras.utils import to_categorical
+import numpy as np
+import pandas as pd
+
+from lib.utils_geo import zero_one_encoding
+
+from helpers import parse_title_wiki,read_geonames
+from gensim.models.keyedvectors import KeyedVectors
+
+from sklearn.preprocessing import LabelEncoder
+
+import numpy as np
+import keras
+
+class DataGenerator(keras.utils.Sequence):
+    'Generates data for Keras'
+    def __init__(self, pairs_of_toponyms,encoder, batch_size=32, shuffle=True):
+        'Initialization'
+        self.data= pairs_of_toponyms
+        self.encoder = encoder
+        self.dim = self.encoder.max_len
+        self.shuffle = shuffle
+    
+        self.batch_size = batch_size
+        self.on_epoch_end()
+
+    def __len__(self):
+        'Denotes the number of batches per epoch'
+        return int(np.floor(len(self.data) / self.batch_size))
+
+    def __getitem__(self, index):
+        'Generate one batch of data'
+        # Generate indexes of the batch
+        indexes = self.indexes[index*self.batch_size:(index+1)*self.batch_size]
+
+        # Generate data
+        return self.__data_generation(indexes)
+
+
+    def on_epoch_end(self):
+        'Updates indexes after each epoch'
+        self.indexes = np.arange(len(self.data))
+        if self.shuffle == True:
+            np.random.shuffle(self.indexes)
+
+    def __data_generation(self, list_ids):
+        'Generates data containing batch_size samples' # X : (n_samples, *dim, n_channels)
+        # Initialization
+        X1 = np.empty((self.batch_size, self.dim))
+        X2 = np.empty((self.batch_size, self.dim))
+        y = np.zeros((self.batch_size,2), dtype=float)
+
+        # Generate data
+        for ix,i in enumerate(list_ids):
+            # Store sample
+            X1[ix,] = self.encoder.encode(self.data.toponym.iloc[i])
+            X2[ix,] = self.encoder.encode(self.data.toponym_context.iloc[i])
+            # Store class
+            y[ix,] = list(zero_one_encoding(self.data.longitude.iloc[i],self.data.latitude.iloc[i]))
+
+        return [X1,X2],y
\ No newline at end of file
diff --git a/lib/geocoder/__init__.py b/lib/geocoder/__init__.py
new file mode 100644
index 0000000000000000000000000000000000000000..e69de29bb2d1d6434b8b29ae775ad8c2e48c5391
diff --git a/lib/geocoder/bert_geocoder.py b/lib/geocoder/bert_geocoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..037ddad3bdeee4d647b56fbf97c925ea4c5f8418
--- /dev/null
+++ b/lib/geocoder/bert_geocoder.py
@@ -0,0 +1,67 @@
+import os
+import sys
+import time
+import random
+import argparse
+import datetime
+
+import pandas as pd
+import numpy as np
+
+import tensorflow as tf
+import torch
+
+from tqdm import tqdm
+tqdm.pandas()
+
+from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
+from keras.preprocessing.sequence import pad_sequences
+from transformers import BertTokenizer
+from transformers import BertForSequenceClassification, AdamW, BertConfig
+from transformers import get_linear_schedule_with_warmup
+
+from ..torch_generator import SentenceDataset
+from ..utils_geo import latlon2healpix,healpix2latlon
+
+import pickle
+
+# If there's a GPU available...
+if torch.cuda.is_available():    
+
+    # Tell PyTorch to use the GPU.    
+    device = torch.device("cuda")
+    print('There are %d GPU(s) available.' % torch.cuda.device_count())
+    print('We will use the GPU:', torch.cuda.get_device_name(0))
+# If not...
+else:
+    print('No GPU available, using the CPU instead.')
+    device = torch.device("cpu")
+
+
+class BertGeocoder():
+    def __init__(self,bert_model_dir,label_healpix_file,healpix_nside=128,batch_size=1):
+        self.bert_model = BertForSequenceClassification.from_pretrained(bert_model_dir)
+        self.bert_model.to(device)
+        self.tokenizer = BertTokenizer.from_pretrained(bert_model_dir,truncation=True)
+        self.label_healpix = {v:k for k, v in pickle.load(open(label_healpix_file,'rb')).items()}
+
+        self.nside = healpix_nside
+
+        self.batch_size = batch_size
+
+    def geocode(self,toponyms, context_toponyms):
+        data = SentenceDataset(pd.DataFrame([[toponyms[i] + " " + context_toponyms[i],0] for i in range(len(toponyms))],columns=["sentence","label"]),self.tokenizer,batch_size=len(toponyms),shuffle=False)
+        dataloader = DataLoader(data,  batch_size=self.batch_size)
+        results = []
+        for step, batch in enumerate(dataloader):
+            b_input_ids = batch[0].to(device)
+            b_input_mask = batch[1].to(device)
+            with torch.no_grad():
+                outputs = self.bert_model(b_input_ids, 
+                                token_type_ids=None, 
+                                attention_mask=b_input_mask)
+            results.append(outputs[0].detach().cpu().numpy())
+        label = np.argmax(np.concatenate(results),axis=1)
+        healpix_label = [self.label_healpix[l] for l in label]
+        lat,lon = healpix2latlon(healpix_label,self.nside)
+        return np.concatenate((lat.reshape(-1,1),lon.reshape(-1,1)),axis=1)
\ No newline at end of file
diff --git a/lib/geocoder/heuristics.py b/lib/geocoder/heuristics.py
new file mode 100644
index 0000000000000000000000000000000000000000..82d7110a5b6066031a15044c1a62d3ef9391f1c9
--- /dev/null
+++ b/lib/geocoder/heuristics.py
@@ -0,0 +1,55 @@
+import pandas as pd
+import numpy as np
+
+from haversine import haversine_vector, Unit
+from sklearn.cluster import DBSCAN
+
+def heuristic_mean(geocoder,toponyms):
+    input_ = np.asarray([[t1,t2] for t2 in toponyms for t1 in toponyms if t2 != t1])
+    res_geocode = pd.DataFrame(input_,columns="t tc".split())
+    lons,lats = geocoder.get_coords(input_[:,0],input_[:,1])
+    res_geocode["lon"] = lons
+    res_geocode["lat"] = lats
+    results = {}
+    for tp in toponyms:
+        lat = res_geocode[res_geocode.t == tp].lat.mean()
+        lon = res_geocode[res_geocode.t == tp].lon.mean()
+        results[tp]={"lat":lat,"lon":lon}
+    return results
+
+def heuristic_no_context(geocoder,toponyms):
+    input_ = np.asarray([[t1,t1] for t2 in toponyms for t1 in toponyms if t2 != t1])
+    res_geocode = pd.DataFrame(input_,columns="t tc".split())
+    lons,lats = geocoder.get_coords(input_[:,0],input_[:,1])
+    res_geocode["lon"] = lons
+    res_geocode["lat"] = lats
+    results = {}
+    for tp in toponyms:
+        lat = res_geocode[res_geocode.t == tp].lat.mean()
+        lon = res_geocode[res_geocode.t == tp].lon.mean()
+        results[tp]={"lat":lat,"lon":lon}
+    return results
+
+def heuristic_cluster(geocoder,toponyms,eps=100):
+    results = {}
+    input_ = np.asarray([[t1,t2] for t2 in toponyms for t1 in toponyms if t2 != t1])
+    res_geocode = pd.DataFrame(input_,columns="t tc".split())
+    lons,lats = geocoder.get_coords(input_[:,0],input_[:,1])
+    res_geocode["lon"] = lons
+    res_geocode["lat"] = lats
+
+    clf = DBSCAN(eps=eps)
+    for t in toponyms:
+        tp_df = res_geocode[res_geocode.tc == t].copy()
+
+        coords = tp_df["lon lat".split()].values
+        clf.fit(haversine_vector(coords,coords,unit="km",comb=True))
+
+        tp_df["cluster"] = clf.labels_
+        counts_ = dict(tp_df.cluster.value_counts())
+        max_cluster = max(counts_, key=counts_.get)
+        tp_df = tp_df[tp_df.cluster == max_cluster]
+        lat = tp_df.lat.median()
+        lon = tp_df.lon.median() #
+        results[t]={"lat":lat,"lon":lon}
+    return results
\ No newline at end of file
diff --git a/lib/geocoder/our_geocoder.py b/lib/geocoder/our_geocoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..0345ea4d8c555086314277cdd8799ef80ba31bcd
--- /dev/null
+++ b/lib/geocoder/our_geocoder.py
@@ -0,0 +1,113 @@
+# NATIVE LIB
+import os
+
+# DATA LIB
+import numpy as np
+import pandas as pd
+
+# DL LIB
+import tensorflow as tf
+import keras.backend as K
+from keras.models import load_model
+from tensorflow.python.keras.backend import set_session
+from tensorflow.python.keras.models import load_model
+
+# CUSTOM LIB
+from lib.word_index import WordIndex
+from lib.ngram_index import NgramIndex
+from lib.utils_geo import haversine_tf_1circle
+
+
+import stanza
+import spacy
+import os
+os.environ['CUDA_VISIBLE_DEVICES'] = '-1'
+
+class Geocoder(object):
+    """
+    >>>geocoder = Geocoder("LSTM_FR.txt_20_4_0.002_None_A_I_C.h5","index_4gram_FR_backup.txt")
+    >>>lon,lat = geocoder.get_coord("Paris","New-York")
+    >>>lon,lat = geocoder.wgs_coord(lon,lat)
+    >>>geocoder.plot_coord("Paris,New-York",lat,lon)
+
+    if you want an interactive map using leafletJS, set to True the `interactive_map` parameter of `Geocoder.plot_coord()`
+    """
+    def __init__(self,keras_model_fn,ngram_index_file,word_index=False):
+        self.keras_model = load_model(keras_model_fn,custom_objects={"loss":haversine_tf_1circle},compile=False)#custom_objects={"accuracy_at_k_lat":lat_accuracy(),"accuracy_at_k_lon":lon_accuracy()})
+        if not word_index:
+            self.ngram_encoder = NgramIndex.load(ngram_index_file)
+        else:
+            self.ngram_encoder = WordIndex.load(ngram_index_file)
+
+    def get_coord(self,toponym,context_toponym):
+        global sess
+        global graph
+        p = self.ngram_encoder.complete(self.ngram_encoder.encode(toponym),self.ngram_encoder.max_len)
+        c = self.ngram_encoder.complete(self.ngram_encoder.encode(context_toponym),self.ngram_encoder.max_len)
+        p = np.array(p)
+        c = np.array(c)       
+        coord = self.keras_model.predict([[p],[c]])
+        return self.wgs_coord(coord[0][0],coord[0][1])
+    
+    def get_coords(self,list_toponym,list_toponym_context):
+        p = [self.ngram_encoder.complete(self.ngram_encoder.encode(toponym),self.ngram_encoder.max_len) for toponym in list_toponym]
+        c = [self.ngram_encoder.complete(self.ngram_encoder.encode(toponym),self.ngram_encoder.max_len) for toponym in list_toponym_context]
+
+        p = np.array(p)
+        c = np.array(c)
+        
+        coords = self.keras_model.predict([p,c])
+        return self.wgs_coord(coords[:,0],coords[:,1]) #lon lat
+
+    def wgs_coord(self,lon,lat):
+        return ((lon*360)-180),((lat*180)-90)
+    
+    def plot_coord(self,toponym,lat,lon,interactive_map=False,**kwargs):
+        if interactive_map:
+            import folium
+            import tempfile
+            import webbrowser
+            fp = tempfile.NamedTemporaryFile(delete=False)
+            m = folium.Map()
+            folium.Marker([lat, lon], popup=toponym).add_to(m)
+            m.save(fp.name)
+            webbrowser.open('file://' + fp.name)
+        else:
+            import matplotlib.pyplot as plt
+            import geopandas
+            fig, ax = plt.subplots(1,**kwargs)
+            world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
+            world.plot(color='white', edgecolor='black',ax=ax)
+            ax.plot(lon,lat,marker='o', color='red', markersize=5)
+            plt.show()
+
+
+    
+class TextGeocoder():
+    def __init__(self,geocoder_model,ner_name,lang,heuristic_func,n_jobs=None):
+        self.geocoder_model = geocoder_model
+        self.ner_name = ner_name
+        self.ner_model = None
+        if self.ner_name == "stanza":
+            self.ner_model = stanza.Pipeline(lang)
+        else:
+            self.ner_model = spacy.load(lang)
+            self.heuristic_func = heuristic_func
+    def __call__(self,a):
+        pass
+
+    def extract_geo_entities(self,text):
+        if self.ner_model == "stanza":
+            entities = [{"text":en.text,"type":en.type,"start":en.start_char,"end":en.end_char} for en in self.ner_model(text).entities  if en.type == "LOC"]
+        else:
+            entities = [{"text":en.text,"type":en.label_,"start":en.start_char,"end":en.end_char} for en in self.ner_model(text).ents if en.label_ in "LOC GPE".split()]
+        return entities
+
+    def geocode(self,entities):
+        df = pd.DataFrame(entities)
+        heuristic_results = self.heuristic_func(self.geocoder_model,df.text.values)
+        for e in range(len(entities)):
+            entities[e]["coord"] = heuristic_results[entities[e]["text"]]
+        return entities
+    
+    
diff --git a/lib/geocoder/svm_geocoder.py b/lib/geocoder/svm_geocoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..2997b2966ccaec79b53849e2a4254f8625c7d873
--- /dev/null
+++ b/lib/geocoder/svm_geocoder.py
@@ -0,0 +1,36 @@
+import numpy as np
+
+from joblib import dump, load
+from tensorflow.keras.utils import to_categorical
+
+from lib.utils_geo import latlon2healpix
+from lib.ngram_index import NgramIndex
+
+
+def parse_bow(x,index):
+    return np.sum(to_categorical(x,num_classes=index.cpt+1),axis=0)
+
+def is_in(lat,lon,hp_predicted,hp_nside):
+    hp_truth = latlon2healpix(lat,lon,hp_nside)
+    return hp_truth == hp_predicted
+
+class SVMGeocoder(object):
+    
+    def __init__(self,model_fn,ngram_index_filename):
+        self.model = load(model_fn)
+        self.ng_index = NgramIndex.load(ngram_index_filename)
+    
+    def geocode(self,phrase1,phrase2):
+        if not phrase1 or not phrase2:
+            return None
+        vec = parse_bow(np.array(self.ng_index.encode(phrase1)),self.ng_index)+\
+            parse_bow(np.array(self.ng_index.encode(phrase2)),self.ng_index)
+        return self.model.predict([vec])[0]
+    
+    def geocode_multi(self,phrases1,phrases2):
+        vecs = np.array([ parse_bow(np.array(self.ng_index.encode(ph)),self.ng_index) for ph in phrases1 if ph])
+        vecs += np.array([ parse_bow(np.array(self.ng_index.encode(ph)),self.ng_index) for ph in phrases2 if ph])
+        return self.model.predict(vecs)
+
+hp = SVMGeocoder("SVMLINEAR_US_FR_AC.bin", "../../outputs/US_FR.txt_100_4_0.002__A_C_index")
+hp.geocode("paris","montpellier")
diff --git a/lib/metrics.py b/lib/metrics.py
new file mode 100644
index 0000000000000000000000000000000000000000..e82c54809aa2a6bece60cd74875140d3719c1ea6
--- /dev/null
+++ b/lib/metrics.py
@@ -0,0 +1,37 @@
+import tensorflow as tf
+
+def lat_accuracy(LAT_TOL =1/180.):
+    def accuracy_at_k_lat(y_true, y_pred):
+        """
+        Metrics use to measure the accuracy of the coordinate prediction. But in comparison to the normal accuracy metrics, we add a tolerance threshold due to the (quasi) impossible 
+        task for neural network to obtain the exact  coordinate.
+
+        Parameters
+        ----------
+        y_true : tf.Tensor
+            truth data
+        y_pred : tf.Tensor
+            predicted output
+        """
+        diff = tf.abs(y_true - y_pred)
+        fit = tf.dtypes.cast(tf.less(diff,LAT_TOL),tf.int64)
+        return tf.reduce_sum(fit)/tf.size(y_pred,out_type=tf.dtypes.int64)
+    return accuracy_at_k_lat
+
+def lon_accuracy(LON_TOL=1/360.):
+    def accuracy_at_k_lon(y_true, y_pred):
+        """
+        Metrics use to measure the accuracy of the coordinate prediction. But in comparison to the normal accuracy metrics, we add a tolerance threshold due to the (quasi) impossible 
+        task for neural network to obtain the exact  coordinate.
+
+        Parameters
+        ----------
+        y_true : tf.Tensor
+            truth data
+        y_pred : tf.Tensor
+            predicted output
+        """
+        diff = tf.abs(y_true - y_pred)
+        fit = tf.dtypes.cast(tf.less(diff,LON_TOL),tf.int64)
+        return tf.reduce_sum(fit)/tf.size(y_pred,out_type=tf.dtypes.int64)
+    return accuracy_at_k_lon
\ No newline at end of file
diff --git a/lib/ngram_index.py b/lib/ngram_index.py
new file mode 100644
index 0000000000000000000000000000000000000000..5f862206aa2f3c852d09a1a2438906b57541b5d6
--- /dev/null
+++ b/lib/ngram_index.py
@@ -0,0 +1,229 @@
+import json
+
+import numpy as np
+
+from ngram import NGram
+from transformers import BertTokenizer
+
+# Machine learning 
+from gensim.models import Word2Vec
+
+
+class bertTokenizer:
+    def __init__(self):
+        self.tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased',do_lower_case=False)
+    
+    def split(self,string):
+        return self.tokenizer.tokenize(string)
+
+class NgramIndex():
+    """
+    Class used for encoding words in ngram representation
+    """
+    def __init__(self,n,bert_tokenization=False,loaded = False):
+        """
+        Constructor
+        
+        Parameters
+        ----------
+        n : int
+            ngram size
+        """
+        self.ngram_gen = NGram(N=n)
+        self.empty_char = "$"
+        if bert_tokenization:
+            self.ngram_gen = bertTokenizer()
+            self.empty_char = "#"
+
+        self.size = n
+        self.ngram_index = {"":0}
+        self.index_ngram = {0:""}
+        self.cpt = 0
+        self.max_len = 0
+
+        self.loaded = loaded
+
+        
+    def split_and_add(self,word):
+        """
+        Split word in multiple ngram and add each one of them to the index
+        
+        Parameters
+        ----------
+        word : str
+            a word
+        """
+        ngrams = str(word).lower().replace(" ",self.empty_char)
+        ngrams = list(self.ngram_gen.split(ngrams))
+        [self.add(ngram) for ngram in ngrams]
+        self.max_len = max(self.max_len,len(ngrams))
+
+    def add(self,ngram):
+        """
+        Add a ngram to the index
+        
+        Parameters
+        ----------
+        ngram : str
+            ngram
+        """
+        if not ngram in self.ngram_index:
+            self.cpt+=1
+            self.ngram_index[ngram]=self.cpt
+            self.index_ngram[self.cpt]=ngram
+        
+
+    def encode(self,word):
+        """
+        Return a ngram representation of a word
+        
+        Parameters
+        ----------
+        word : str
+            a word
+        
+        Returns
+        -------
+        list of int
+            listfrom shapely.geometry import Point,box
+ of ngram index
+        """
+        ngrams = str(word).lower().replace(" ",self.empty_char)
+        ngrams = list(self.ngram_gen.split(ngrams))
+        ngrams = [ng for ng in ngrams if ng.count(self.empty_char)<2]
+        if not self.loaded:
+            [self.add(ng) for ng in ngrams if not ng in self.ngram_index]
+        return self.complete([self.ngram_index[ng] for ng in ngrams if ng in self.ngram_index],self.max_len)
+
+    def complete(self,ngram_encoding,MAX_LEN,filling_item=0):
+        """
+        Complete a ngram encoded version of word with void ngram. It's necessary for neural network.
+        
+        Parameters
+        ----------
+        ngram_encoding : list of int
+            first encoding of a word
+        MAX_LEN : int
+            desired length of the encoding
+        filling_item : int, optional
+            ngram index you wish to use, by default 0
+        
+        Returns
+        -------
+        list of int
+            list of ngram index
+        """
+        if self.loaded and len(ngram_encoding) >=MAX_LEN:
+            return ngram_encoding[:MAX_LEN]
+        assert len(ngram_encoding) <= MAX_LEN
+        diff = MAX_LEN - len(ngram_encoding)
+        ngram_encoding.extend([filling_item]*diff)  
+        return ngram_encoding
+    
+    def get_embedding_layer(self,texts,dim=100,**kwargs):
+        """
+        Return an embedding matrix for each ngram using encoded texts. Using gensim.Word2vec model.
+        
+        Parameters
+        ----------
+        texts : list of [list of int]
+            list of encoded word
+        dim : int, optional
+            embedding dimension, by default 100
+        
+        Returns
+        -------
+        np.array
+            embedding matrix
+        """
+        model = Word2Vec([[str(w) for w in t] for t in texts], size=dim,window=5, min_count=1, workers=4,**kwargs)
+        N = len(self.ngram_index)
+        embedding_matrix = np.zeros((N,dim))
+        for i in range(N):
+            if str(i) in model.wv:
+                embedding_matrix[i] = model.wv[str(i)]
+        return embedding_matrix
+
+    def get_glove_embedding_layer(self,texts,dim=100,**kwargs):
+        """
+        Return an embedding matrix for each ngram using encoded texts. Using gensim.Word2vec model.
+        
+        Parameters
+        ----------
+        texts : list of [list of int]
+            list of encoded word
+        dim : int, optional
+            embedding dimension, by default 100
+        
+        Returns
+        -------
+        np.array
+            embedding matrix
+        """
+        from glove import Corpus, Glove
+        corpus = Corpus()
+        corpus.fit([[str(w) for w in t] for t in texts], window=10)
+        glove = Glove(no_components=dim, learning_rate=0.05)
+        glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
+        glove.add_dictionary(corpus.dictionary)
+        N = len(self.ngram_index)
+        embedding_matrix = np.zeros((N,dim))
+        for i in range(N):
+            if str(i) in glove.dictionary:
+                embedding_matrix[i] = glove.word_vectors[glove.dictionary[str(i)]]
+        return embedding_matrix
+        
+
+    def save(self,fn):
+        """
+
+        Save the NgramIndex
+        
+        Parameters
+        ----------
+        fn : str
+            output filename
+        """
+        data = {
+            "ngram_size": self.size,
+            "ngram_index": self.ngram_index,
+            "cpt_state": self.cpt,
+            "max_len_state": self.max_len
+        }
+        json.dump(data,open(fn,'w'))
+
+    @staticmethod
+    def load(fn):
+        """
+        
+        Load a NgramIndex state from a file.
+        
+        Parameters
+        ----------
+        fn : str
+            input filename
+        
+        Returns
+        -------
+        NgramIndex
+            ngram index
+        
+        Raises
+        ------
+        KeyError
+            raised if a required field does not appear in the input file
+        """
+        try:
+            data = json.load(open(fn))
+        except json.JSONDecodeError:
+            print("Data file must be a JSON")
+        for key in ["ngram_size","ngram_index","cpt_state","max_len_state"]:
+            if not key in data:
+                raise KeyError("{0} field cannot be found in given file".format(key))
+        new_obj = NgramIndex(data["ngram_size"],loaded=True)
+        new_obj.ngram_index = data["ngram_index"]
+        new_obj.index_ngram = {v:k for k,v in new_obj.ngram_index.items()}
+        new_obj.cpt = data["cpt_state"]
+        new_obj.max_len = data["max_len_state"]
+        return new_obj
+
diff --git a/lib/run.py b/lib/run.py
new file mode 100644
index 0000000000000000000000000000000000000000..f2a7b79bc6cb5534b748326ee0db155057f24c35
--- /dev/null
+++ b/lib/run.py
@@ -0,0 +1,223 @@
+
+import subprocess
+import time
+import numpy as np
+
+class Chronometer:
+    """
+    To be used for mesure time execution of a block of code
+    >>> import time
+    >>> chrono = Chronometer()
+    >>> chrono.start("task1")
+    >>> time.sleep(1)
+    >>> duration = chrono.stop("task1")
+    >>> print(duration) #Should display '1'
+
+    """
+    def __init__(self):
+        self.__task_begin_timestamp = {}
+
+    def start(self, task_name):
+        """
+        Start a new task chronometer
+        
+        Parameters
+        ----------
+        task_name : str
+            task id
+        
+        Raises
+        ------
+        ValueError
+            if a running task already exists with that name
+        """
+        if task_name in self.__task_begin_timestamp:
+            raise ValueError(
+                "A running task exists with the name {0}!".format(task_name)
+            )
+        self.__task_begin_timestamp[task_name] = time.time()
+
+    def stop(self, task_name):
+        """
+        Stop and return the duration of the task
+        
+        Parameters
+        ----------
+        task_name : str
+            task id
+        
+        Returns
+        -------
+        float
+            duration of the task in seconds
+        
+        Raises
+        ------
+        ValueError
+            if no task exist with the id `task_name`
+        """
+        if not task_name in self.__task_begin_timestamp:
+            raise ValueError("The {0} task does not exist!".format(task_name))
+        
+        duration = time.time() - self.__task_begin_timestamp[task_name]
+        del self.__task_begin_timestamp[task_name]
+
+        return duration
+
+class Run(object):
+    """
+    Define a task to execute. A task here is associated to a command line. A task is defined by two entities :
+     * base_command : runnable
+     * kwargs : parameters associate to the command  
+    
+    Parmeters formating follows `argparse` format:
+    * "-i" or "--input" : optional parameter
+    * "input" : required parameter
+
+    >>> task1 = Run("task1","echo",text="hello word")
+    >>> task1.run()
+
+    With optional parameter, we have to use a trick ;)
+
+    >>> task1 = Run("task1","echo",**{"text":"hello word","-e":"args")
+    >>> task1.run()
+
+    To save the output, indicate an output filename when the task is run :
+    >>> task1.run("output_file.txt")
+    """
+    def __init__(self,task_name,base_command,**kwargs):
+        """
+        Constructor
+        
+        Parameters
+        ----------
+        command_base : str
+            command base
+        **kwargs : dict
+            parameters
+        """
+        self.chrono = Chronometer()
+        self.task_name = task_name
+        self.base_command = base_command
+
+        self.run_args = kwargs
+        
+
+    def get_command(self):
+        """
+        Return the shell command build on the task attributes (basic command and parameters)
+        
+        Returns
+        -------
+        str
+            command
+        """
+        command = self.base_command
+        for key,value in self.run_args.items():
+            if "-" in key:
+                command = command + " {0} {1}".format(key,value)
+            else:
+                command = command + " {0}".format(value)
+        return command
+
+    def add_parameter(self, key, value):
+        """
+        Add a parameter to the task
+        
+        Parameters
+        ----------
+        key : str
+            key
+        value : object
+            value
+        """ 
+        self.run_args[key] = value
+
+    def run(self,log_filename = None):
+        """
+        Run the task
+        
+        Parameters
+        ----------
+        log_filename : str, optional
+            log filename, by default None
+        """
+        self.chrono.start(self.task_name)
+
+        out_proc = subprocess.PIPE
+        if log_filename:
+            out_proc = open(log_filename,'a')
+            print(4)
+        process = subprocess.Popen(self.get_command().split(),stdout=out_proc)
+        _, _ = process.communicate() # We don't care of the output (if so, we use the log_filename argument)
+
+        duration = self.chrono.stop(self.task_name)
+        print("RUN {0} finished in {1}seconds OR {2}minutes OR {3}hours".format(\
+            self.task_name,duration,duration/60,(duration/60)/60
+            ))
+    def __repr__(self):
+        return "; ".join(["{0}={1}".format(k,v) for k,v in self.run_args.items()])
+
+
+class GridSearchModel:
+    """
+    Define a set of model executions based on a set of parameters and their values variations.
+
+    For the parameters format, please check the `Run` documentations.
+    
+    >>> grid = GridSearchModel("ls",test=["-l", "-h","-lh"])
+    >>> grid.run()
+    """
+    def __init__(self,command_base,**kwargs):
+        """
+        Constructor
+        
+        Parameters
+        ----------
+        command_base : str
+            command base
+        **kwargs : dict
+            parameters
+        """
+        self.parameters = kwargs
+        self.cpt = 0
+        self.number_of_combination = np.prod([len(v) for _,v in self.parameters.items()])
+        
+        
+
+        self.tasks = []
+        for cpt in range(self.number_of_combination):
+            new_task = Run(str(cpt),command_base)
+            self.tasks.append(new_task)
+
+        for key,values in self.parameters.items():
+            split_ = int(self.number_of_combination/len(values))
+            i = 0
+            for val in values:
+                for task in self.tasks[i:i+split_]:
+                    task.add_parameter(key,val)
+                i += split_
+
+    def __repr__(self):
+        return "\n".join([ t.__repr__() for t in self.tasks])           
+        
+    def run(self,log_filename=None):
+        """
+        Run all the tasks defined
+        
+        Parameters
+        ----------
+        log_filename : str, optional
+            log filename, by default None
+        """
+        i=0
+        for task in self.tasks:
+            task.run(log_filename=log_filename+"_"+str(i))
+            i+=1
+
+    
+if __name__ == "__main__":
+    g = GridSearchModel("ls",test=["-l", "-h","-lh"],rel=["-i"])
+    print(g)
+
+    #g.run()
diff --git a/lib/torch_generator.py b/lib/torch_generator.py
new file mode 100644
index 0000000000000000000000000000000000000000..3613086bf6d5262572eead956baeb18dac71ff1a
--- /dev/null
+++ b/lib/torch_generator.py
@@ -0,0 +1,50 @@
+import torch
+from keras.preprocessing.sequence import pad_sequences
+import numpy as np 
+
+def chunks(lst, n):
+    """Yield successive n-sized chunks from lst."""
+    for i in range(0, len(lst), n):
+        yield lst[i:i + n]
+
+class SentenceDataset(torch.utils.data.Dataset):
+    'Characterizes a dataset for PyTorch'
+    def __init__(self, dataframe,tokenizer,max_len=96,batch_size=32,shuffle=True):
+        'Initialization'
+        self.sentences = dataframe["sentence"].values
+        self.labels = dataframe["label"].values
+        self.tokenizer = tokenizer
+        self.max_len = max_len
+
+        self.batch_size = batch_size
+        a = np.arange(len(dataframe))
+        if shuffle:
+            np.random.shuffle(a)
+        self.batch_tokenization = list(chunks(a,batch_size))
+        assert(len(self.batch_tokenization[0])==batch_size)
+        self.current_batch_id = 0
+        self.boundaries = (0,0+batch_size)
+        self.current_batch_tokenized = self.tokenize(self.current_batch_id)
+
+    def tokenize(self,batch_index):
+        X = [ self.tokenizer.encode(self.sentences[x],add_special_tokens = True,max_length=512,truncation=True) for x in self.batch_tokenization[batch_index]]# Tokenizer
+        X = pad_sequences(X, maxlen=self.max_len, dtype="long", value=0, truncating="post", padding="post").tolist()
+        return X
+
+    def __len__(self):
+        'Denotes the total number of samples'
+        return len(self.sentences)
+    def __getitem__(self, index):
+        'Generates one sample of data'
+        if not index < self.boundaries[1] or not index >= self.boundaries[0]:
+            self.current_batch_id = index//self.batch_size
+            self.current_batch_tokenized = self.tokenize(self.current_batch_id)
+            self.boundaries= (self.current_batch_id*self.batch_size,self.current_batch_id*self.batch_size + self.batch_size)
+        # Load data and get label
+        
+        index_in_batch = index-self.boundaries[0]
+        #print(self.boundaries,index_in_batch)
+        X = self.current_batch_tokenized[index_in_batch]
+        M = [int(token_id > 0) for token_id in X] # attention mask
+        y = self.labels[index]
+        return torch.tensor(np.array(X)),torch.tensor(np.array(M)),torch.tensor(np.array(y))
\ No newline at end of file
diff --git a/lib/utils.py b/lib/utils.py
new file mode 100644
index 0000000000000000000000000000000000000000..82531b38dea65bd305a04e37d2f07d3fae0fbd77
--- /dev/null
+++ b/lib/utils.py
@@ -0,0 +1,220 @@
+# Basic import 
+import math
+import argparse
+import os
+import json
+import time
+import datetime
+
+# Data Structure
+import numpy as np
+import geopandas as gpd
+from shapely.geometry import Point,box
+
+# NLP 
+from nltk.tokenize import word_tokenize
+from ngram import NGram
+
+# Visualisation and parallelisation
+from tqdm import tqdm
+
+class LabelEncoder():
+    def __init__(self):
+        self.dict_ = {}
+        self.cpt = 0
+    
+    def fit_transform(self,list_element):
+        self.fit(list_element)
+        return self.transform(list_element)
+
+    def fit(self,list_element):
+        for l in list_element:
+            if not l in self.dict_:
+                self.dict_[l] = self.cpt
+                self.cpt+=1
+    def transform(self,list_element):
+        return [self.dict_[l] for l in list_element]
+    
+    def get_num_classes(self):
+        return self.cpt
+
+class TokenizerCustom():
+    def __init__(self,vocab):
+        self.word_index = {vocab[i]:i for i in range(len(vocab))}
+        self.index_word = {i:vocab[i] for i in range(len(vocab))}
+        self.N = len(self.index_word)
+    def texts_to_sequences(self,listText):
+        seqs = []
+        for text in listText:
+            seqs.append([self.word_index[word] for word in word_tokenize(text) if word in self.word_index])
+        return seqs
+
+
+class ConfigurationReader(object):
+    def __init__(self,configuration_file):
+        if not os.path.exists(configuration_file):
+            raise FileNotFoundError("'{0} file could not be found ! '".format(configuration_file))
+
+        self.configuration = json.load(open(configuration_file))
+
+        self.__argparser_desc = ("" if not "description" in self.configuration else self.configuration["description"])
+        self.parser = argparse.ArgumentParser(description=self.__argparser_desc)
+
+        self.parse_conf()
+    
+    def parse_conf(self):
+        if not "args" in self.configuration:
+            raise argparse.ArgumentError("","No args given in the configuration file")
+        
+        for dict_args in self.configuration["args"]:
+            if not isinstance(dict_args,dict):
+                raise ValueError("Args must be dictionnary")
+
+            short_command = dict_args.get("short",None)
+            long_command = dict_args.get("long",None)
+            
+            if not short_command and not long_command:
+                raise ValueError("No command name was given !") 
+            
+            add_func_dict_= {}
+            if "help" in dict_args:
+                add_func_dict_["help"]= dict_args["help"]
+            if "default" in dict_args:
+                add_func_dict_["default"]= dict_args["default"]
+            if "action" in dict_args:
+                add_func_dict_["action"]= dict_args["action"]
+            if "type" in dict_args:
+                add_func_dict_["type"]= eval(dict_args["type"])
+            if "choices" in dict_args:
+                add_func_dict_["choices"]= dict_args["choices"]
+
+            if not (short_command and long_command):
+                command = (short_command if not long_command else long_command)
+                self.parser.add_argument(command,**add_func_dict_)
+
+            elif long_command and short_command:
+                self.parser.add_argument(short_command,long_command,**add_func_dict_)
+    
+    def parse_args(self,input_=None):
+        if not input_:
+            return self.parser.parse_args()
+        return self.parser.parse_args(input_)
+
+
+class MetaDataSerializer(object):
+    def __init__(self,
+    model_name,
+    dataset_name,
+    rel_code,
+    cooc_sample_size,
+    adj_iteration,
+    ngram_size,
+    tolerance_value,
+    epochs,
+    embedding_dim,
+    word2vec_iter_nb,
+    index_fn,
+    keras_model_fn,
+    train_test_history_fn):
+        self.model_name = model_name
+        self.dataset_name = dataset_name
+        self.rel_code = rel_code
+        self.cooc_sample_size = cooc_sample_size
+        self.adj_iteration = adj_iteration
+        self.ngram_size = ngram_size
+        self.tolerance_value = tolerance_value
+        self.epochs = epochs
+        self.embedding_dim = embedding_dim
+        self.word2vec_iter_nb = word2vec_iter_nb
+        self.index_fn = index_fn
+        self.keras_model_fn = keras_model_fn
+        self.train_test_history_fn = train_test_history_fn
+    
+    def save(self,fn):
+        json.dump({
+        "model_name":self.model_name,
+        "dataset_name" : self.dataset_name,
+        "rel_code" : self.rel_code,
+        "cooc_sample_size" : self.cooc_sample_size,
+        "adj_iteration" : self.adj_iteration,
+        "ngram_size" : self.ngram_size,
+        "tolerance_value" : self.tolerance_value,
+        "epochs" : self.epochs,
+        "embedding_dim" : self.embedding_dim,
+        "word2vec_iter_nb" : self.word2vec_iter_nb,
+        "index_fn" : self.index_fn,
+        "keras_model_fn" : self.keras_model_fn,
+        "train_test_history_fn" : self.train_test_history_fn
+        },open(fn,'w'))
+
+import time
+
+class Chronometer:
+    def __init__(self):
+        self.__task_begin_timestamp = {}
+
+    def start(self, task_name):
+        """
+        Start a new task chronometer
+        
+        Parameters
+        ----------
+        task_name : str
+            task id
+        
+        Raises
+        ------
+        ValueError
+            if a running task already exists with that name
+        """
+        if task_name in self.__task_begin_timestamp:
+            raise ValueError(
+                "A running task exists with the name {0}!".format(task_name)
+            )
+        self.__task_begin_timestamp[task_name] = time.time()
+
+    def stop(self, task_name):
+        """
+        Stop and return the duration of the task
+        
+        Parameters
+        ----------
+        task_name : str
+            task id
+        
+        Returns
+        -------
+        float
+            duration of the task in seconds
+        
+        Raises
+        ------
+        ValueError
+            if no task exist with the id `task_name`
+        """
+        if not task_name in self.__task_begin_timestamp:
+            raise ValueError("The {0} task does not exist!".format(task_name))
+        
+        duration = time.time() - self.__task_begin_timestamp[task_name]
+        del self.__task_begin_timestamp[task_name]
+
+        return duration
+
+
+        # Function to calculate the accuracy of our predictions vs labels
+def flat_accuracy(preds, labels):
+    pred_flat = np.argmax(preds, axis=1).flatten()
+    labels_flat = labels.flatten()
+    return np.sum(pred_flat == labels_flat) / len(labels_flat)
+
+
+
+def format_time(elapsed):
+    '''
+    Takes a time in seconds and returns a string hh:mm:ss
+    '''
+    # Round to the nearest second.
+    elapsed_rounded = int(round((elapsed)))
+    
+    # Format as hh:mm:ss
+    return str(datetime.timedelta(seconds=elapsed_rounded))
\ No newline at end of file
diff --git a/lib/utils_geo.py b/lib/utils_geo.py
new file mode 100644
index 0000000000000000000000000000000000000000..00b8a1a2fba238cd5b01ad3710a0dd8b4f2aec8d
--- /dev/null
+++ b/lib/utils_geo.py
@@ -0,0 +1,444 @@
+
+import geopandas as gpd
+import numpy as np
+import pandas as pd
+
+from shapely.geometry import Point,box 
+import healpy
+
+from tqdm import tqdm
+
+
+import pandas as pd, numpy as np
+from numba import njit
+from helpers import read_geonames
+from tqdm import tqdm
+from joblib import Parallel,delayed
+
+import tensorflow as tf
+import keras.backend as K
+
+def tf_deg2rad(deg):
+    pi_on_180 = 0.017453292519943295
+    return deg * pi_on_180
+
+# convert lat and lon to a healpix code encoding a region, with a given resolution
+def latlon2healpix( lat , lon , res ):
+    lat = np.radians(lat)
+    lon = np.radians(lon)
+    xs = ( np.cos(lat) * np.cos(lon) ) #
+    ys = ( np.cos(lat) * np.sin(lon) ) # -> Sphere coordinates: https://vvvv.org/blog/polar-spherical-and-geographic-coordinates
+    zs = ( np.sin(lat) ) # 
+    return healpy.vec2pix( int(res) , xs , ys , zs )
+
+def healpix2latlon( code , nside ):
+    xs, ys, zs = healpy.pix2vec( nside , code )
+    lat =  np.arctan2(zs, np.sqrt(xs * xs + ys * ys)) * 180.0 / np.pi 
+    lon =  np.arctan2(ys, xs) * 180.0 / np.pi 
+    return lat, lon
+
+def haversine_tf(y_true,y_pred):
+    """
+    Return the geodesic distance between (lon1,lat1) and (lon2,lat2) coordinates
+    
+    Parameters
+    ----------
+    lon1 : numeric or array-like (pandas Dataframe works also)
+        longitude of first coordinates
+    lat1 :  numeric or array-like (pandas Dataframe works also)
+        latitude of first coordinates
+    lon2 :  numeric or array-like (pandas Dataframe works also)
+        longitude of second coordinates
+    lat2 :  numeric or array-like (pandas Dataframe works also)
+        longitude of second coordinates
+    
+    Returns
+    -------
+    float or array-like
+        distance(s) value(s)
+    """
+    lon1, lat1, lon2, lat2 = map(tf_deg2rad, [y_true[:,0], y_true[:,1], y_pred[:,0], y_pred[:,1]])
+    dlon = lon2 - lon1
+    dlat = lat2 - lat1
+    a = K.sin(dlat/2.0)**2 + K.cos(lat1) * K.cos(lat2) * K.sin(dlon/2.0)**2
+    
+    return 6367 * 2 * tf.math.asin(K.sqrt(a))
+
+def haversine_tf_1circle(y_true,y_pred):
+    """
+    Return the geodesic distance between (lon1,lat1) and (lon2,lat2) coordinates
+    
+    Parameters
+    ----------
+    lon1 : numeric or array-like (pandas Dataframe works also)
+        longitude of first coordinates
+    lat1 :  numeric or array-like (pandas Dataframe works also)
+        latitude of first coordinates
+    lon2 :  numeric or array-like (pandas Dataframe works also)
+        longitude of second coordinates
+    lat2 :  numeric or array-like (pandas Dataframe works also)
+        longitude of second coordinates
+    
+    Returns
+    -------
+    float or array-like
+        distance(s) value(s)
+    """
+    lon1, lat1, lon2, lat2 = map(tf_deg2rad, [y_true[:,0], y_true[:,1], y_pred[:,0], y_pred[:,1]])
+    dlon = lon2 - lon1
+    dlat = lat2 - lat1
+    a = K.sin(dlat/2.0)**2 + K.cos(lat1) * K.cos(lat2) * K.sin(dlon/2.0)**2
+    
+    return 1 * 2 * tf.math.asin(K.sqrt(a))
+
+def to_wgs84_lat(lat):
+    return ((lat*180)-90)
+def to_wgs84_lon(lon):
+    return ((lon*360)-180)
+
+def to_wgs84(x):
+    lon=to_wgs84_lon(x[:,0])
+    lat=to_wgs84_lat(x[:,1])
+    return tf.stack([lon,lat],axis=1)
+
+def accuracy_k(k=100):#km
+    def compute_metric(y_true,y_pred):
+        return K.less_equal(haversine_tf(to_wgs84(y_true),to_wgs84(y_pred)),k) 
+    return compute_metric
+
+def haversine_pd(lon1, lat1, lon2, lat2):
+    """
+    Return the geodesic distance between (lon1,lat1) and (lon2,lat2) coordinates
+    
+    Parameters
+    ----------
+    lon1 : numeric or array-like (pandas Dataframe works also)
+        longitude of first coordinates
+    lat1 :  numeric or array-like (pandas Dataframe works also)
+        latitude of first coordinates
+    lon2 :  numeric or array-like (pandas Dataframe works also)
+        longitude of second coordinates
+    lat2 :  numeric or array-like (pandas Dataframe works also)
+        longitude of second coordinates
+    
+    Returns
+    -------
+    float or array-like
+        distance(s) value(s)
+    """
+    lon1, lat1, lon2, lat2 = map(np.radians, [lon1, lat1, lon2, lat2])
+    dlon = lon2 - lon1
+    dlat = lat2 - lat1
+    a = np.sin(dlat/2.0)**2 + np.cos(lat1) * np.cos(lat2) * np.sin(dlon/2.0)**2
+    
+    return 6367 * 2 * np.arcsin(np.sqrt(a))
+
+
+def get_adjacent(ids,lon1, lat1, lon2, lat2,threshold):
+    dist_ = haversine_pd(lon1, lat1, lon2, lat2)
+    return ids[dist_<threshold]
+
+def get_geonames_adjacency(geoname_data,threshold):
+    return Parallel(n_jobs=-1,backend="multiprocessing")(delayed(get_adjacent)(geoname_data.geonameid.values,
+    geoname_data.longitude,
+    geoname_data.latitude,
+    row.longitude,
+    row.latitude,
+    threshold) for ix,row in tqdm(geoname_data.iterrows(),total=len(geoname_data)))
+
+
+def generate_couple(object_list):
+    """
+    Return a randomly selected couple from an object list.
+    
+    Parameters
+    ----------
+    object_list : list
+        object list
+    
+    Returns
+    -------
+    list
+        list of coupled object
+    """
+    couples = []
+    lst = np.arange(len(object_list))
+    for _ in range(len(object_list)):
+        if len(lst) == 1:
+            break
+        idx = np.random.choice(np.arange(len(lst)))
+        idx2 = np.random.choice(np.arange(len(lst)))
+        while idx2 == idx:
+            idx2 = np.random.choice(np.arange(len(lst)))
+        couples.append([object_list[lst[idx]],object_list[lst[idx2]]])
+        lst = np.delete(lst,idx)
+    return couples
+
+def _hash_couple(o1,o2):
+    """
+    Return an hash for two object ids.
+    
+    Parameters
+    ----------
+    o1 : str or int
+        id of the first objeeect
+    o2 : str of int
+        id of the second object
+    
+    Returns
+    -------
+    str
+        hash
+    """
+    return "|".join(map(str,sorted([int(o1),int(o2)])))
+
+
+
+def zero_one_encoding(long,lat):
+    """
+    Encode coordinates (WGS84) between 0 and 1
+    
+    Parameters
+    ----------
+    long : float
+        longitude value
+    lat : float
+        latitude value
+    
+    Returns
+    -------
+    float,float
+        longitude, latitude
+    """
+    return ((long + 180.0 ) / 360.0), ((lat + 90.0 ) / 180.0) 
+
+class Cell(object):
+    """
+    A cell is box placed in geeographical space.
+    """
+    def __init__(self,upperleft_x,upperleft_y,bottomright_x,bottomright_y,x,y):
+        """
+        Constructor
+        
+        Parameters
+        ----------
+        upperleft_x : float
+            upperleft longitude
+        upperleft_y : float
+            upperleft latitude
+        bottomright_x : float
+            bottom right longitude
+        bottomright_y : float
+            bottom right latitude
+        x : int
+            cell x coordinates in the grid
+        y : int
+            cell y coordinates in the grid
+        """
+        self.upperleft_x,self.upperleft_y,self.bottomright_x,self.bottomright_y = upperleft_x,upperleft_y,bottomright_x,bottomright_y
+        self.box_ = box(self.upperleft_x,self.upperleft_y,self.bottomright_x,self.bottomright_y)
+        self.list_object={} # {id:Point(coord)}
+
+        self.x,self.y = x, y
+
+    def contains(self,lat,lon):
+        """
+        Return true if the cell contains a point at given coordinates
+        
+        Parameters
+        ----------
+        lat : float
+            latitude
+        lon : float
+            longitude
+        
+        Returns
+        -------
+        bool
+            true if contains
+        """ 
+        x,y = lon,lat
+        if x < self.upperleft_x or x > self.bottomright_x:
+            return False
+        if y < self.upperleft_y or y > self.bottomright_y:
+            return False
+        return True
+    
+    def add_object(self,id_,lat,lon):
+        """
+        Connect an object to the cell
+        
+        Parameters
+        ----------
+        id_ : int
+            id
+        lat : float
+            latitude
+        lon : float
+            longitude
+        """
+        self.list_object[id_] = Point(lon,lat)
+            
+    def __repr__(self):
+        return  "upperleft:{0}_{1}_;bottom_right:{2}_{3}".format(self.upperleft_x,self.upperleft_y,self.bottomright_x,self.bottomright_y)
+    
+        
+class Grid(object):
+    """
+    Define a grid 
+    
+    """
+    def __init__(self,upperleft_x,upperleft_y,bottomright_x,bottomright_y,cell_sub_div_index=[100,50]):
+        """
+        Constructor
+        
+        Parameters
+        ----------
+        upperleft_x : float
+            upperleft longitude
+        upperleft_y : float
+            upperleft latitude
+        bottomright_x : float
+            bottom right longitude
+        bottomright_y : float
+            bottom right latitude
+        cell_sub_div_index : list, optional
+            number of division in both latitude and longitude axis (longitude first), by default [100,50]
+        """
+        self.upperleft_x,self.upperleft_y,self.bottomright_x,self.bottomright_y = upperleft_x,upperleft_y,bottomright_x,bottomright_y
+        
+        self.x_r = abs(self.bottomright_x - self.upperleft_x)/cell_sub_div_index[0]
+        self.y_r = abs(self.upperleft_y - self.bottomright_y )/cell_sub_div_index[1]
+        
+        self.c_x_r = self.x_r/cell_sub_div_index[0] # Redivide
+        self.c_y_r = self.y_r/cell_sub_div_index[1]
+        
+        self.cells = []
+        self.inter_cells = []
+        for i in range(cell_sub_div_index[1]):
+            self.cells.append([])
+            for j in range(cell_sub_div_index[0]):
+                self.cells[-1].append(Cell(
+                    self.upperleft_x+j*self.x_r,
+                    self.upperleft_y+i*self.y_r,
+                    self.upperleft_x+((j+1)*self.x_r),
+                    self.upperleft_y+((i+1)*self.y_r),
+                    j,i)
+                )
+        dec_y = 0 
+        for i in range(cell_sub_div_index[1]):
+            self.inter_cells.append([])
+            dec_x = 0 
+            for j in range(cell_sub_div_index[0]):                 
+                self.inter_cells[-1].append(Cell(
+                    self.upperleft_x+(j*self.x_r)-self.c_x_r, # TOP
+                    self.upperleft_y+(i*self.y_r)-dec_y,
+                    self.upperleft_x+((j+1)*self.x_r)-self.c_x_r,#(self.u_pos*self.c_x_r),
+                    self.upperleft_y+((i+1)*self.y_r)+self.c_y_r,
+                    j,i)
+                )
+                self.inter_cells[-1].append(Cell(
+                    self.upperleft_x+(j*self.x_r)-self.c_x_r, # CENTER
+                    self.upperleft_y+(i*self.y_r)-self.c_y_r,
+                    self.upperleft_x+((j+1)*self.x_r)+self.c_x_r,
+                    self.upperleft_y+((i+1)*self.y_r)+self.c_y_r,
+                    j,i)
+                )
+                self.inter_cells[-1].append(Cell(
+                    self.upperleft_x+(j*self.x_r)+dec_x, # CENTER
+                    self.upperleft_y+(i*self.y_r)-self.c_y_r,
+                    self.upperleft_x+((j+1)*self.x_r)-self.c_x_r, #LEFT
+                    self.upperleft_y+((i+1)*self.y_r)+self.c_y_r,
+                    j,i)
+                )
+                dec_x = self.c_x_r
+            dec_y = self.c_y_r
+    
+    def fit_data(self,data = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))):
+        """
+        
+        To avoid unnecessary check when connecting an entity to one or multiple cells, we 
+        filter cells that does not appears in our geographic context (here countries surface).
+        
+        Parameters
+        ----------
+        data : GeoDataFrame
+            geographic context
+        """
+        world = data 
+        world["nn"] = 1
+        dissolved = world.dissolve(by="nn").iloc[0].geometry
+        new_cells= []
+        new_inter_cells=[]
+        for i in tqdm(range(len(self.cells))):
+            for j in range(len(self.cells[i])):
+                if dissolved.intersects(self.cells[i][j].box_):
+                    new_cells.append(self.cells[i][j])
+                    new_inter_cells.extend(self.inter_cells[i][j*3:(j+1)*3])
+                    
+        self.cells=new_cells
+        self.inter_cells = new_inter_cells
+        
+                    
+    def __add__(self,a): 
+        """
+        Add an object to the grid
+        
+        Parameters
+        ----------
+        a : tuple
+            (id, latitude, longitude)
+        """
+        for c1 in range(len(self.cells)):
+            if self.cells[c1].contains(a[1],a[2]):
+                self.cells[c1].add_object(*a)
+                
+        for c1 in range(len(self.inter_cells)):
+            if self.inter_cells[c1].contains(a[1],a[2]):
+                self.inter_cells[c1].add_object(*a)
+                
+    def get_adjacent_relationships(self,random_iteration=10):
+        """
+        Return a list of adjacent relationships founds in each cell.
+        
+        Parameters
+        ----------
+        random_iteration : int, optional
+            number of iteration for random selection of adjacency relationships, by default 10
+        
+        Returns
+        -------
+        list
+            adjacency relationships
+        """
+        relationships = set([])
+        for c1 in tqdm(range(len(self.cells))):
+            for _ in range(random_iteration):
+                for t in generate_couple(list(self.cells[c1].list_object.keys())):
+                    relationships.add(_hash_couple(t[0],t[1]))
+
+        for c1 in tqdm(range(len(self.inter_cells))):
+            for _ in range(random_iteration):
+                for t in generate_couple(list(self.inter_cells[c1].list_object.keys())):
+                    relationships.add(_hash_couple(t[0],t[1]))
+        return relationships
+    
+
+
+def get_adjacency_rels(geodataframe,bounds,subdiv_tuple,random_iter_adjacency):
+    g = Grid(*bounds,subdiv_tuple)
+    g.fit_data()
+    [g+(int(row.geonameid),row.latitude,row.longitude) for ix,row in tqdm(geodataframe["geonameid longitude latitude".split()].iterrows(),total=len(geodataframe))]
+    return [[int(i) for i in r.split("|")] for r in g.get_adjacent_relationships(random_iter_adjacency)]
+
+def get_geonames_inclusion_rel(geonames_data,geonames_hierarchy_data_fn):
+    geonames_hierarchy_data = pd.read_csv(geonames_hierarchy_data_fn,sep="\t",header=None,names="parentId,childId,type".split(",")).fillna("")
+    geonamesIDS = set(geonames_data.geonameid.values)
+    filter_mask = (geonames_hierarchy_data.childId.isin(geonamesIDS) & geonames_hierarchy_data.parentId.isin(geonamesIDS))
+    return (geonames_hierarchy_data[filter_mask]["childId parentId".split()].values.tolist())
+
+def get_bounds(geodataframe):
+    geodataframe["geometry"] = geodataframe["longitude latitude".split()].apply(lambda x: Point(x.longitude,x.latitude),axis=1)
+    geodataframe = gpd.GeoDataFrame(geodataframe)
+    geodataframe["i"]=1
+    return geodataframe.dissolve("i").bounds.values[0] # Required to get adjacency relationships
diff --git a/lib/word_index.py b/lib/word_index.py
new file mode 100644
index 0000000000000000000000000000000000000000..98e66f8fd383a636aa5b74ffed257943202e6b33
--- /dev/null
+++ b/lib/word_index.py
@@ -0,0 +1,180 @@
+import json
+
+import numpy as np
+
+from ngram import NGram
+
+# Machine learning 
+from gensim.models import Word2Vec
+
+class WordIndex():
+    """
+    Class used for encoding words in ngram representation
+    """
+    def __init__(self,loaded = False):
+        """
+        Constructor
+        
+        Parameters
+        ----------
+        loaded : bool
+            if loaded from external file
+        """
+        self.ngram_index = {"":0}
+        self.index_ngram = {0:""}
+        self.cpt = 0
+        self.max_len = 0
+
+        self.loaded = loaded
+
+    def split_and_add(self,word):
+        """
+        Split word in multiple ngram and add each one of them to the index
+        
+        Parameters
+        ----------
+        word : str
+            a word
+        """
+        grams = word.lower().split(" ")
+        [self.add(subword) for subword in grams ]
+        self.max_len = max(self.max_len,len(grams))
+
+    def add(self,subword):
+        """
+        Add a ngram to the index
+        
+        Parameters
+        ----------
+        ngram : str
+            ngram
+        """
+        if not subword in self.ngram_index:
+            self.cpt+=1
+            self.ngram_index[subword]=self.cpt
+            self.index_ngram[self.cpt]=subword
+        
+
+    def encode(self,word):
+        """
+        Return a ngram representation of a word
+        
+        Parameters
+        ----------
+        word : str
+            a word
+        
+        Returns
+        -------
+        list of int
+            listfrom shapely.geometry import Point,box
+ of ngram index
+        """
+        subwords = [w.lower() for w in word.split(" ")]
+        if not self.loaded:
+            [self.add(ng) for ng in subwords if not ng in self.ngram_index]
+        if self.max_len < len(subwords):
+            self.max_len = max(self.max_len,len(subwords))
+        return self.complete([self.ngram_index[ng] for ng in subwords if ng in self.ngram_index],self.max_len)
+
+    def complete(self,ngram_encoding,MAX_LEN,filling_item=0):
+        """
+        Complete a ngram encoded version of word with void ngram. It's necessary for neural network.
+        
+        Parameters
+        ----------
+        ngram_encoding : list of int
+            first encoding of a word
+        MAX_LEN : int
+            desired length of the encoding
+        filling_item : int, optional
+            ngram index you wish to use, by default 0
+        
+        Returns
+        -------
+        list of int
+            list of ngram index
+        """
+        if self.loaded and len(ngram_encoding) >=MAX_LEN:
+            return ngram_encoding[:MAX_LEN]
+        assert len(ngram_encoding) <= MAX_LEN
+        diff = MAX_LEN - len(ngram_encoding)
+        ngram_encoding.extend([filling_item]*diff)  
+        return ngram_encoding
+    
+    def get_embedding_layer(self,texts,dim=100,**kwargs):
+        """
+        Return an embedding matrix for each ngram using encoded texts. Using gensim.Word2vec model.
+        
+        Parameters
+        ----------
+        texts : list of [list of int]
+            list of encoded word
+        dim : int, optional
+            embedding dimension, by default 100
+        
+        Returns
+        -------
+        np.array
+            embedding matrix
+        """
+        model = Word2Vec([[str(w) for w in t] for t in texts], size=dim,window=5, min_count=1, workers=4,**kwargs)
+        N = len(self.ngram_index)
+        embedding_matrix = np.zeros((N,dim))
+        for i in range(N):
+            if str(i) in model.wv:
+                embedding_matrix[i] = model.wv[str(i)]
+        return embedding_matrix
+
+    def save(self,fn):
+        """
+
+        Save the NgramIndex
+        
+        Parameters
+        ----------
+        fn : str
+            output filename
+        """
+        data = {
+            "word_index": self.ngram_index,
+            "cpt_state": self.cpt,
+            "max_len_state": self.max_len
+        }
+        json.dump(data,open(fn,'w'))
+
+    @staticmethod
+    def load(fn):
+        """
+        
+        Load a NgramIndex state from a file.
+        
+        Parameters
+        ----------
+        fn : str
+            input filename
+        
+        Returns
+        -------
+        NgramIndex
+            ngram index
+        
+        Raises
+        ------
+        KeyError
+            raised if a required field does not appear in the input file
+        """
+        try:
+            data = json.load(open(fn))
+        except json.JSONDecodeError:
+            print("Data file must be a JSON")
+        for key in ["word_index","cpt_state","max_len_state"]:
+            if not key in data:
+                raise KeyError("{0} field cannot be found in given file".format(key))
+        new_obj = WordIndex(loaded=True)
+        new_obj.ngram_index = data["ngram_index"]
+        new_obj.index_ngram = {v:k for k,v in new_obj.ngram_index.items()}
+        new_obj.cpt = data["cpt_state"]
+        new_obj.max_len = data["max_len_state"]
+        return new_obj
+
diff --git a/parser_config/toponym_combination_embedding.json b/parser_config/toponym_combination_embedding.json
new file mode 100644
index 0000000000000000000000000000000000000000..260d6ec129527b76e4f0ff2002a0dd83d4e80e01
--- /dev/null
+++ b/parser_config/toponym_combination_embedding.json
@@ -0,0 +1,20 @@
+{
+    "description": "Toponym Combination",
+    "args": [
+        { "short": "geoname_input", "help": "Filepath of the Geonames file you want to use." },
+        { "short": "geoname_hierachy_input", "help": "Filepath of the Geonames file you want to use." },
+        { "short": "-v", "long": "--verbose", "action": "store_true" },
+        { "short": "-i", "long": "--inclusion", "action": "store_true" },
+        { "short": "-a", "long": "--adjacency", "action": "store_true" },
+        { "short": "-w", "long": "--wikipedia-cooc", "action": "store_true" },
+        { "long": "--wikipedia-cooc-fn","help":"Cooccurrence data filename"},
+        { "long": "--cooc-sample-size", "type": "int", "default": 1 },
+        {"long": "--adjacency-iteration", "type":"int","default":1},
+        { "short": "-n", "long": "--ngram-size", "type": "int", "default": 4 },
+        { "long": "--ngram-word2vec-iter", "type": "int", "default": 50 },
+        { "short": "-t", "long": "--tolerance-value", "type": "float", "default": 0.002 },
+        { "short": "-e", "long": "--epochs", "type": "int", "default": 100 },
+        { "short": "-d", "long": "--dimension", "type": "int", "default": 256 },
+        {  "long": "--admin_code_1", "default": "None" }
+    ]
+}
\ No newline at end of file
diff --git a/parser_config/toponym_combination_embedding_v2.json b/parser_config/toponym_combination_embedding_v2.json
new file mode 100644
index 0000000000000000000000000000000000000000..345c1d7d49f767b0076b11574538c46952bad788
--- /dev/null
+++ b/parser_config/toponym_combination_embedding_v2.json
@@ -0,0 +1,20 @@
+{
+    "description": "Toponym Combination",
+    "args": [
+        { "short": "geoname_input", "help": "Filepath of the Geonames file you want to use." },
+        { "short": "geoname_hierachy_input", "help": "Filepath of the Geonames file you want to use." },
+        { "short": "-v", "long": "--verbose", "action": "store_true" },
+        { "short": "-i", "long": "--inclusion", "action": "store_true" },
+        { "short": "-a", "long": "--adjacency", "action": "store_true" },
+        { "short": "-w", "long": "--wikipedia-cooc", "action": "store_true" },
+        { "long": "--wikipedia-cooc-fn","help":"Cooccurrence data filename"},
+        { "long": "--cooc-sample-size", "type": "int", "default": 1 },
+        {"long": "--adjacency-iteration", "type":"int","default":1},
+        { "short": "-n", "long": "--ngram-size", "type": "int", "default": 4 },
+        { "long": "--ngram-word2vec-iter", "type": "int", "default": 50 },
+        { "short": "-t", "long": "--tolerance-value", "type": "float", "default": 100 },
+        { "short": "-e", "long": "--epochs", "type": "int", "default": 100 },
+        { "short": "-d", "long": "--dimension", "type": "int", "default": 256 },
+        {  "long": "--admin_code_1", "default": "None" }
+    ]
+}
\ No newline at end of file
diff --git a/parser_config/toponym_combination_embedding_v3.json b/parser_config/toponym_combination_embedding_v3.json
new file mode 100644
index 0000000000000000000000000000000000000000..507e9c5123fcf2df11ff8c7e5348f3a50ce32dd4
--- /dev/null
+++ b/parser_config/toponym_combination_embedding_v3.json
@@ -0,0 +1,20 @@
+{
+    "description": "Toponym Combination",
+    "args": [
+        { "short": "dataset_name", "help": "Filepath of the Geonames file you want to use." },
+        { "short": "geoname_inclusion", "help": "Filepath of the Geonames file you want to use." },
+        { "short": "geonames_adjacent", "help": "Filepath of the Geonames file you want to use." },
+        { "long": "wikipedia_cooc", "help": "Cooccurrence data filename" },
+        { "short": "-v", "long": "--verbose", "action": "store_true" },
+        { "short": "-i", "long": "--inclusion", "action": "store_true" },
+        { "short": "-a", "long": "--adjacency", "action": "store_true" },
+        { "short": "-w", "long": "--wikipedia", "action": "store_true" },
+        { "short": "-n", "long": "--ngram-size", "type": "int", "default": 4 },
+        { "long": "--ngram-word2vec-iter", "type": "int", "default": 50 },
+        { "short": "-t", "long": "--tolerance-value", "type": "float", "default": 100 },
+        { "short": "-e", "long": "--epochs", "type": "int", "default": 100 },
+        { "short": "-d", "long": "--dimension", "type": "int", "default": 256 },
+        { "short": "-l", "long": "--lstm-layer", "type": "int", "default": 2, "choices": [1, 2] },
+        { "long": "--tokenization-method", "type": "str", "default": "char-level", "choices": ["char-level", "word-level", "bert"] }
+    ]
+}
\ No newline at end of file
diff --git a/predict_toponym_coordinates.py b/predict_toponym_coordinates.py
new file mode 100644
index 0000000000000000000000000000000000000000..ec5d967b0a4f4c5f473147aa96bb66c7bc3737cf
--- /dev/null
+++ b/predict_toponym_coordinates.py
@@ -0,0 +1,133 @@
+from keras.models import load_model
+import os
+import tensorflow as tf
+import keras.backend as K
+from lib.ngram_index import NgramIndex
+from lib.word_index import WordIndex
+import numpy as np
+
+from tensorflow.python.keras.backend import set_session
+from tensorflow.python.keras.models import load_model
+
+
+from lib.utils_geo import haversine_tf_1circle
+sess = None
+graph = None
+
+def lat_accuracy(LAT_TOL =1/180.):
+    def accuracy_at_k_lat(y_true, y_pred):
+        """
+        Metrics use to measure the accuracy of the coordinate prediction. But in comparison to the normal accuracy metrics, we add a tolerance threshold due to the (quasi) impossible 
+        task for neural network to obtain the exact  coordinate.
+
+        Parameters
+        ----------
+        y_true : tf.Tensor
+            truth data
+        y_pred : tf.Tensor
+            predicted output
+        """
+        diff = tf.abs(y_true - y_pred)
+        fit = tf.dtypes.cast(tf.less(diff,LAT_TOL),tf.int64)
+        return tf.reduce_sum(fit)/tf.size(y_pred,out_type=tf.dtypes.int64)
+    return accuracy_at_k_lat
+
+def lon_accuracy(LON_TOL=1/360.):
+    def accuracy_at_k_lon(y_true, y_pred):
+        """
+        Metrics use to measure the accuracy of the coordinate prediction. But in comparison to the normal accuracy metrics, we add a tolerance threshold due to the (quasi) impossible 
+        task for neural network to obtain the exact  coordinate.
+
+        Parameters
+        ----------
+        y_true : tf.Tensor
+            truth data
+        y_pred : tf.Tensor
+            predicted output
+        """
+        diff = tf.abs(y_true - y_pred)
+        fit = tf.dtypes.cast(tf.less(diff,LON_TOL),tf.int64)
+        return tf.reduce_sum(fit)/tf.size(y_pred,out_type=tf.dtypes.int64)
+    return accuracy_at_k_lon
+
+class Geocoder(object):
+    """
+    >>>geocoder = Geocoder("LSTM_FR.txt_20_4_0.002_None_A_I_C.h5","index_4gram_FR_backup.txt")
+    >>>lon,lat = geocoder.get_coord("Paris","New-York")
+    >>>lon,lat = geocoder.wgs_coord(lon,lat)
+    >>>geocoder.plot_coord("Paris,New-York",lat,lon)
+
+    if you want an interactive map using leafletJS, set to True the `interactive_map` parameter of `Geocoder.plot_coord()`
+    """
+    def __init__(self,keras_model_fn,ngram_index_file):
+        # global sess
+        # global graph
+        # sess = tf.compat.v1.Session()
+        # graph = tf.compat.v1.get_default_graph()
+        # set_session(sess)
+        self.keras_model = load_model(keras_model_fn,custom_objects={"loss":haversine_tf_1circle},compile=False)#custom_objects={"accuracy_at_k_lat":lat_accuracy(),"accuracy_at_k_lon":lon_accuracy()})
+        self.ngram_encoder = NgramIndex.load(ngram_index_file)
+
+    def get_coord(self,toponym,context_toponym):
+        global sess
+        global graph
+        p = self.ngram_encoder.complete(self.ngram_encoder.encode(toponym),self.ngram_encoder.max_len)
+        c = self.ngram_encoder.complete(self.ngram_encoder.encode(context_toponym),self.ngram_encoder.max_len)
+        p = np.array(p)
+        c = np.array(c)       
+        # with sess.as_default():
+        #     with graph.as_default():
+        coord = self.keras_model.predict([[p],[c]])
+        return coord[0][0],coord[0][1]
+    
+    def get_coords(self,list_toponym,list_toponym_context):
+        p = [self.ngram_encoder.complete(self.ngram_encoder.encode(toponym),self.ngram_encoder.max_len) for toponym in list_toponym]
+        c = [self.ngram_encoder.complete(self.ngram_encoder.encode(toponym),self.ngram_encoder.max_len) for toponym in list_toponym_context]
+
+        p = np.array(p)
+        c = np.array(c)
+        
+        coords = self.keras_model.predict([p,c])
+        return coords[0],coords[1]
+
+    def wgs_coord(self,lon,lat):
+        return ((lon*360)-180),((lat*180)-90)
+    
+    def plot_coord(self,toponym,lat,lon,interactive_map=False,**kwargs):
+        if interactive_map:
+            import folium
+            import tempfile
+            import webbrowser
+            fp = tempfile.NamedTemporaryFile(delete=False)
+            m = folium.Map()
+            folium.Marker([lat, lon], popup=toponym).add_to(m)
+            m.save(fp.name)
+            webbrowser.open('file://' + fp.name)
+        else:
+            import matplotlib.pyplot as plt
+            import geopandas
+            fig, ax = plt.subplots(1,**kwargs)
+            world = geopandas.read_file(geopandas.datasets.get_path('naturalearth_lowres'))
+            world.plot(color='white', edgecolor='black',ax=ax)
+            ax.plot(lon,lat,marker='o', color='red', markersize=5)
+            plt.show()
+
+
+
+if __name__ == "__main__":
+    from flask import Flask, escape, request, render_template
+
+    app = Flask(__name__)
+
+
+    geocoder = Geocoder("outputs/LSTM_FR.txt_100_4_0.002_None_A_I_C.h5","./outputs/FR.txt_100_4_0.002_None_A_I_C_index")
+
+    @app.route('/',methods=["GET"])
+    def display():
+        toponym = request.args.get("top", "Paris")
+        c_toponym = request.args.get("c_top", "Cherbourg")
+        lon,lat = geocoder.get_coord(toponym,c_toponym)
+        lon,lat = geocoder.wgs_coord(lon,lat)
+        return  render_template("skeleton.html",lat=lat,lon=lon)
+
+    app.run(host='0.0.0.0')
\ No newline at end of file
diff --git a/region_model.py b/region_model.py
new file mode 100644
index 0000000000000000000000000000000000000000..b7a66738ccc07b3403da4da5772b55afa7c816c5
--- /dev/null
+++ b/region_model.py
@@ -0,0 +1,199 @@
+# Base module 
+import os
+
+# Structure
+import pandas as pd
+
+# DEEPL module
+from keras.layers import Dense, Input, Embedding,concatenate,Bidirectional,LSTM,Dropout
+from keras.models import Model
+from keras.callbacks import ModelCheckpoint
+from tensorflow.keras.layers import Lambda
+import keras.backend as K 
+import tensorflow as tf 
+from lib.custom_layer import *
+
+# Custom module
+from lib.ngram_index import NgramIndex
+from lib.utils import ConfigurationReader, MetaDataSerializer,LabelEncoder
+from lib.metrics import lat_accuracy,lon_accuracy
+from lib.data_generator import DataGenerator,CoOccurrences,load_embedding,Inclusion,Adjacency
+from lib.utils_geo import haversine_tf,accuracy_k,haversine_tf_1circle
+
+# Logging
+import logging
+
+logging.getLogger('gensim').setLevel(logging.WARNING)
+
+from helpers import EpochTimer
+
+# LOGGING CONF
+logging.basicConfig(
+    format='[%(asctime)s][%(levelname)s] %(message)s ', 
+    datefmt='%m/%d/%Y %I:%M:%S %p',
+    level=logging.INFO  
+    )
+
+args = ConfigurationReader("./parser_config/toponym_combination_embedding_v2.json")\
+    .parse_args()#("-i --inclusion-fn ../data/geonamesData/hierarchy.txt ../data/geonamesData/allCountries.txt ../data/embeddings/word2vec4gram/4gramWiki+geonames_index.json ../data/embeddings/word2vec4gram/embedding4gramWiki+Geonames.bin".split())
+
+#.parse_args("-w  --wikipedia-cooc-fn  subsetCoocALLv2.csv ../data/geonamesData/allCountries.txt ../data/embeddings/word2vec4gram/4gramWiki+geonames_index.json ../data/embeddings/word2vec4gram/embedding4gramWiki+Geonames.bin".split())
+
+#
+#################################################
+############# MODEL TRAINING PARAMETER ##########
+#################################################
+NGRAM_SIZE = args.ngram_size
+ACCURACY_TOLERANCE = args.k_value
+EPOCHS = args.epochs
+ADJACENCY_SAMPLING = args.adjacency_sample
+COOC_SAMPLING = args.cooc_sample
+WORDVEC_ITER = 50
+EMBEDDING_DIM = args.dimension
+BATCH_SIZE = args.batch_size
+#################################################
+########## FILENAME VARIABLE ####################
+#################################################
+# check for output dir
+if not os.path.exists("outputs/"):
+    os.makedirs("outputs/")
+
+GEONAME_FN = args.geoname_input
+DATASET_NAME = args.geoname_input.split("/")[-1]
+GEONAMES_HIERARCHY_FN = args.inclusion_fn
+ADJACENCY_REL_FILENAME = args.adjacency_fn
+COOC_FN = args.wikipedia_cooc_fn
+
+PREFIX_OUTPUT_FN = "REGION_{0}_{1}_{2}_{3}".format(
+    GEONAME_FN.split("/")[-1],
+    EPOCHS,
+    NGRAM_SIZE,
+    ACCURACY_TOLERANCE)
+
+REL_CODE=""
+if args.adjacency:
+    PREFIX_OUTPUT_FN += "_A"
+    REL_CODE+= "A"
+if args.inclusion:
+    PREFIX_OUTPUT_FN += "_I"
+    REL_CODE+= "I"
+if args.wikipedia_cooc:
+    PREFIX_OUTPUT_FN += "_C"
+    REL_CODE+= "C"
+
+MODEL_OUTPUT_FN = "outputs/{0}.h5".format(PREFIX_OUTPUT_FN)
+INDEX_FN = "outputs/{0}_index".format(PREFIX_OUTPUT_FN)
+HISTORY_FN = "outputs/{0}.csv".format(PREFIX_OUTPUT_FN)
+
+
+meta_data = MetaDataSerializer(
+    DATASET_NAME,
+    REL_CODE,
+    COOC_SAMPLING,
+    ADJACENCY_SAMPLING,
+    NGRAM_SIZE,
+    ACCURACY_TOLERANCE,
+    EPOCHS,
+    EMBEDDING_DIM,
+    WORDVEC_ITER,
+    INDEX_FN,
+    MODEL_OUTPUT_FN,
+    HISTORY_FN
+)
+meta_data.save("outputs/{0}.json".format(PREFIX_OUTPUT_FN))
+
+
+### PUT DATASRC + GENERATOR
+
+index = NgramIndex.load(args.ngram_index_fn)
+
+train_src = []
+test_src = []
+
+class_encoder = LabelEncoder()
+if args.wikipedia_cooc:
+    train_src.append(CoOccurrences(COOC_FN + "_train.csv",class_encoder,sampling=4))
+    test_src.append(CoOccurrences(COOC_FN + "_test.csv",class_encoder,sampling=4))
+
+if args.adjacency:
+    a_train = Adjacency(ADJACENCY_REL_FILENAME + "_train.csv",GEONAME_FN,sampling=ADJACENCY_SAMPLING,gzip=False)
+    a_test = Adjacency(ADJACENCY_REL_FILENAME + "_test.csv",GEONAME_FN,sampling=ADJACENCY_SAMPLING,gzip=False)
+    train_src.append(a_train)
+    test_src.append(a_test)
+
+if args.inclusion:
+    i_train = Inclusion(GEONAME_FN,GEONAMES_HIERARCHY_FN+"_train.csv")
+    i_test = Inclusion(GEONAME_FN,GEONAMES_HIERARCHY_FN+"_test.csv")
+    train_src.append(i_train)
+    test_src.append(i_test)
+#Adjacency
+
+
+
+d_train = DataGenerator(train_src,index,class_encoder,batch_size=BATCH_SIZE,only_healpix=True) 
+d_test = DataGenerator(test_src,index,class_encoder,batch_size=BATCH_SIZE,only_healpix=True) 
+
+num_words = len(index.index_ngram)  
+
+#############################################################################################
+################################# NGRAM EMBEDDINGS ##########################################
+#############################################################################################
+
+embedding_weights = load_embedding(args.embedding_fn) 
+
+
+#############################################################################################
+################################# MODEL DEFINITION ##########################################
+#############################################################################################
+
+from keras import regularizers
+
+input_1 = Input(shape=(index.max_len,))
+input_2 = Input(shape=(index.max_len,))
+
+embedding_layer = Embedding(num_words, EMBEDDING_DIM,input_length=index.max_len,trainable=False)#, trainable=True)
+
+x1 = embedding_layer(input_1)
+x2 = embedding_layer(input_2)
+
+# Each LSTM learn on a permutation of the input toponyms
+biLSTM = Bidirectional(LSTM(32,activation="pentanh", recurrent_activation="pentanh"))
+x1 = biLSTM(x1)
+x2 = biLSTM(x2)
+x = concatenate([x1,x2])#,x3])
+
+#x = Dense(class_encoder.get_num_classes()*2,activation="relu")(x)
+
+
+aux_layer = Dense(class_encoder.get_num_classes(),activation="softmax",name="aux_layer")(x)
+
+model = Model(inputs = [input_1,input_2], outputs = aux_layer)#input_3
+
+model.compile(loss={"aux_layer":"categorical_crossentropy"}, optimizer='adam',metrics={"aux_layer":"accuracy"})
+
+
+#############################################################################################
+################################# TRAINING LAUNCH ###########################################
+#############################################################################################
+
+checkpoint = ModelCheckpoint(MODEL_OUTPUT_FN + ".part", monitor='loss', verbose=1,
+    save_best_only=True, mode='auto', period=1)
+
+epoch_timer = EpochTimer("outputs/"+PREFIX_OUTPUT_FN+"_epoch_timer_output.csv")
+
+
+history = model.fit_generator(generator=d_train,
+    validation_data=d_test,
+    verbose=True,
+    epochs=EPOCHS,
+    callbacks=[checkpoint,epoch_timer])
+
+
+hist_df = pd.DataFrame(history.history)
+hist_df.to_csv(HISTORY_FN)
+
+model.save(MODEL_OUTPUT_FN)
+
+# Erase Model Checkpoint file
+if os.path.exists(MODEL_OUTPUT_FN + ".part"):
+    os.remove(MODEL_OUTPUT_FN + ".part")
\ No newline at end of file
diff --git a/requirements.txt b/requirements.txt
new file mode 100644
index 0000000000000000000000000000000000000000..a2016e76db0a4bfbfa04c06ff282a42a904bd83b
--- /dev/null
+++ b/requirements.txt
@@ -0,0 +1,27 @@
+#pyroutelib3
+node2vec
+#osrm
+geopandas
+pandas
+numpy
+tqdm
+networkx
+matplotlib
+joblib
+gensim
+sklearn
+tensorflow
+keras
+ngram
+shapely
+sqlitedict
+nltk
+folium
+flask
+numba
+healpy
+stanza
+spacy
+torch
+torchvision
+transformers
\ No newline at end of file
diff --git a/run_multiple_configurations.py b/run_multiple_configurations.py
new file mode 100644
index 0000000000000000000000000000000000000000..49753c5308bb6519a631a3fc7d578480571bf45b
--- /dev/null
+++ b/run_multiple_configurations.py
@@ -0,0 +1,22 @@
+from lib.run import GridSearchModel
+from collections import OrderedDict
+c_f = "--wikipedia-cooc-fn ../data/wikipedia/cooccurrence_FR.txt"
+
+# Init GridsearchModel
+grid = GridSearchModel(\
+    "python3 combination_embeddingsv3inverse.py",
+    **OrderedDict({ # necessary because some args have to be given in a certain order
+    "rel":["-w "+c_f,("-i -w "+c_f),"-a -w "+c_f,"-a -i -w "+c_f], # ,"-a -i -w "+c_f ,"-i -a"
+    "-n":[4],
+    "--ngram-word2vec-iter" :[100],
+    "-e":[100],
+    "geoname_fn":"../data/geonamesData/FR.txt".split(),
+    "hierarchy_fn":"../data/geonamesData/hierarchy.txt".split()
+    }.items()))
+
+print("########### THE FOLLOWING COMMAND(S) WILL BE EXECUTED ###########" )
+[print(task.get_command()) for task in grid.tasks]
+print("#################################################################")
+grid.run("outputs/log_{0}".format("FR_model_v2"))
+
+#["-w --wikipedia-cooc-fn ../data/wikipedia/cooccurrence_FR.txt","-w --wikipedia-cooc-fn ../data/wikipedia/cooccurrence_FR.txt -a","-w --wikipedia-cooc-fn ../data/wikipedia/cooccurrence_FR.txt -i"]
\ No newline at end of file
diff --git a/scripts/embeddingngram.py b/scripts/embeddingngram.py
new file mode 100644
index 0000000000000000000000000000000000000000..a9773eb58578780cef70eb28dd2315f9f61fbd97
--- /dev/null
+++ b/scripts/embeddingngram.py
@@ -0,0 +1,59 @@
+#!/usr/bin/env python
+# coding: utf-8
+
+
+from lib.ngram_index import NgramIndex
+from lib.utils_geo import read_geonames
+
+
+
+import pandas as pd
+import numpy as np
+from tqdm import tqdm
+
+
+from tqdm import tqdm
+
+
+
+from gensim.models import Word2Vec
+import logging
+logging.basicConfig(level="INFO")
+
+
+
+df_cooc = pd.read_csv("../data/wikipedia/cooccurrence_ALL.txt",sep="\t")
+df_geo = read_geonames("../data/geonamesData/allCountries.txt") 
+
+
+geonames_label = df_geo.name.values.tolist()
+wiki_labels = df_cooc.title.values.tolist()
+p= [wiki_labels.extend(x.split("|")) for x in df_cooc["interlinks"].values]
+
+
+del df_geo
+del df_cooc
+
+N = 5
+
+
+ng = NgramIndex(N)
+p = [ng.split_and_add(x) for x in tqdm(geonames_label)]
+p = [ng.split_and_add(x) for x in tqdm(wiki_labels)]
+ng.save("{0}gramWiki+Geonames_index.json".format(N))
+
+geonames_label.extend(wiki_labels)
+
+class MySentences(object):
+    def __init__(self, texts):
+        self.texts = texts
+
+    def __iter__(self):
+        for w in self.texts:
+            yield [str(x)for x in ng.encode(w)]
+
+model = Word2Vec(MySentences(geonames_label), size=100, window=5, min_count=1, workers=4,sg=1)
+model.save("embedding{0}gramWiki+Geonames.bin".format(5))
+
+
+
diff --git a/scripts/generate_cooc_geocoding_dataset.py b/scripts/generate_cooc_geocoding_dataset.py
new file mode 100644
index 0000000000000000000000000000000000000000..232f40b3b0cdb0ee59698759be3f1b5e1c2c874f
--- /dev/null
+++ b/scripts/generate_cooc_geocoding_dataset.py
@@ -0,0 +1,41 @@
+import pandas as pd 
+import re
+
+#### TODO NEED TO add ARGPARSE !!!
+def parse_title_wiki(title_wiki):
+    """
+    Parse Wikipedia title
+    
+    Parameters
+    ----------
+    title_wiki : str
+        wikipedia title
+    
+    Returns
+    -------
+    str
+        parsed wikipedia title
+    """
+    return re.sub("\(.*\)", "", title_wiki).strip().lower()
+
+
+df = pd.read_csv("./cooccurrence_US_FR.txt",sep="\t")
+
+df["interlinks"] = df.interlinks.apply(lambda x : x.split("|"))
+df["interlinks"] = df.interlinks.apply(lambda x : [parse_title_wiki(i) for i in x])
+
+df["title"] = df.title.apply(parse_title_wiki)
+
+def generated_inputs(x):
+    output = []
+    for interlink in x.interlinks:
+        output.append([x.title,interlink,x.longitude,x.latitude])
+    return output
+
+output_ =  []
+for ix,row in df.iterrows():
+    output_.extend(generated_inputs(row))
+
+new_df = pd.DataFrame(output_,columns="name1 name2 longitude latitude".split())  
+new_df = new_df.sample(frac=1)
+new_df.to_csv("us_fr_cooc_test.csv",index=False)
\ No newline at end of file
diff --git a/scripts/get_all_adjacency_rel.py b/scripts/get_all_adjacency_rel.py
new file mode 100644
index 0000000000000000000000000000000000000000..23382a6a79b5f75594ca640169afeaa4fcedee9a
--- /dev/null
+++ b/scripts/get_all_adjacency_rel.py
@@ -0,0 +1,88 @@
+import pandas as pd, numpy as np
+from numba import njit
+from helpers import read_geonames
+from tqdm import tqdm
+from joblib import Parallel,delayed
+import geopandas as gpd
+from lib.utils_geo import Grid,haversine_pd
+import matplotlib.pyplot as plt
+
+import argparse
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument("geoname_fn")
+parser.add_argument("kilometer_threshold",type=int,default=20)
+parser.add_argument("output_fn_prefix")
+
+args = parser.parse_args("../data/geonamesData/allCountries.txt 20 /home/jacques/ALL_ADJ_224+_".split())
+
+GEONAME_FN = args.geoname_fn
+PREFIX_OUTPUT_FN = args.output_fn_prefix
+KM_THRESHOLD = args.kilometer_threshold
+
+df = read_geonames(GEONAME_FN)
+
+def to_str(list_):
+    """
+    Return str representation for each value in list_
+    
+    Parameters
+    ----------
+    list_ : array
+        array
+    
+    Returns
+    -------
+    array
+        str list
+    """
+    return list(map(str,list_))
+
+def get_adjacent(geonameid,ids,lon1, lat1, lon2, lat2,threshold):
+    """
+    Write adjacent entry in geonames for a selected entry
+    """
+    dist_ = haversine_pd(lon1, lat1, lon2, lat2)
+    adj_ids = ids[dist_<threshold]
+    out_.write("\n{0},{1},{2},{3}".format(geonameid,"|".join(to_str(adj_ids)),lat2,lon2))
+    out_.flush()
+
+
+# WE BUILD a grid over the world map
+# It allows to limit unnecessary calculus thus accelerate the whole process
+world = gpd.read_file("/media/jacques/DATA/GEODATA/WORLD/world.geo.50m.dissolved")
+g = Grid(*world.bounds.values[0],[40,20]) #We build a grid of cell of 40° by 20°
+g.fit_data(world)
+
+# Prepare first output
+first_output_fn = "{1}{0}_cells.csv".format(KM_THRESHOLD,PREFIX_OUTPUT_FN)
+out_ = open(first_output_fn,'w')
+out_.write("geonameid,adjacent_geonameid,latitude,longitude") # HEADER
+out_.flush() # Avoid writing bugs
+
+def get_rels(cells_list):
+    for c in tqdm(cells_list):
+        
+        mask1 = (df.latitude <= c.bottomright_y) & (df.latitude >= c.upperleft_y)
+        new_df = df[mask1].copy() 
+        mask2 = (new_df.longitude >= c.upperleft_x) & (new_df.longitude <= c.bottomright_x)
+        new_df = new_df[mask2]
+        for ix,row in new_df.iterrows():
+            get_adjacent(row.geonameid,new_df.geonameid.values,new_df.longitude,new_df.latitude,row.longitude,row.latitude,KM_THRESHOLD)
+        #Parallel(n_jobs=-1,backend="multiprocessing",temp_folder="/home/jacques/temp/")(delayed(get_adjacent)(row.geonameid,new_df.geonameid.values,new_df.longitude,new_df.latitude,row.longitude,row.latitude,KM_THRESHOLD) for ix,row in new_df.iterrows())
+
+world = gpd.read_file(gpd.datasets.get_path('naturalearth_lowres'))
+ax = world.plot(color="white",edgecolor="black")
+for c in g.cells[224:]:
+    ax.plot(*c.box_.exterior.xy)
+plt.show()
+get_rels(g.cells[224:]) #~3h
+
+# Prepare second output
+# second_output_fn = "{1}{0}_inter_cells.csv".format(KM_THRESHOLD,PREFIX_OUTPUT_FN)
+# out_ = open(second_output_fn,'w')
+# out_.write("geonameid,adjacent_geonameid,latitude,longitude") # HEADER
+# out_.flush()# Avoid writing bugs
+
+# get_rels(g.inter_cells) 594
diff --git a/scripts/get_cooccurrence.py b/scripts/get_cooccurrence.py
new file mode 100644
index 0000000000000000000000000000000000000000..4f2bcedf006b029dc907a0cc13e20e18c6158e67
--- /dev/null
+++ b/scripts/get_cooccurrence.py
@@ -0,0 +1,40 @@
+import gzip
+import json
+import re
+
+import argparse
+
+import pandas as pd
+
+from joblib import Parallel,delayed
+from tqdm import tqdm
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument("page_of_interest_fn")
+parser.add_argument("output_fn")
+parser.add_argument("-c","--corpus",action="append")
+
+args = parser.parse_args()#("../wikidata/sample/place_en_fr_page_clean_onlyfrplace.csv test.txt -c frwiki-latest.json.gz -c enwiki-latest.json.gz".split())
+
+PAGES_OF_INTEREST_FILE = args.page_of_interest_fn
+WIKIPEDIA_CORPORA = args.corpus
+OUTPUT_FN = args.output_fn
+
+if len(WIKIPEDIA_CORPORA)<1:
+    raise Exception('No corpora was given!')
+
+df = pd.read_csv(PAGES_OF_INTEREST_FILE)
+page_of_interest = set(df.title.values)
+
+page_coord = {row.title : (row.longitude,row.latitude) for ix,row in df.iterrows()}
+
+output = open(OUTPUT_FN,'w')
+output.write("title\tinterlinks\tlongitude\tlatitude\n")
+for wikipedia_corpus in WIKIPEDIA_CORPORA:
+    for line in tqdm(gzip.GzipFile(wikipedia_corpus,'rb')):
+        data = json.loads(line)
+        if data["title"] in page_of_interest:
+            occ = page_of_interest.intersection(data["interlinks"].keys())
+            coord = page_coord[data["title"]]
+            if len(occ) >0:output.write(data["title"]+"\t"+"|".join(occ)+"\t{0}\t{1}".format(*coord)+"\n")
diff --git a/scripts/gethealpix.py b/scripts/gethealpix.py
new file mode 100644
index 0000000000000000000000000000000000000000..6e572fdb256e92c9a6690df2b7e806e1b11e1573
--- /dev/null
+++ b/scripts/gethealpix.py
@@ -0,0 +1,32 @@
+
+
+import pandas as pd
+
+from tqdm import tqdm
+tqdm.pandas()
+import argparse
+
+import numpy as np
+import healpy
+# convert lat and lon to a healpix code encoding a region, with a given resolution
+def latlon2healpix( lat , lon , res ):
+    lat = np.radians(lat)
+    lon = np.radians(lon)
+    xs = ( np.cos(lat) * np.cos(lon) )#
+    ys = ( np.cos(lat) * np.sin(lon) )# -> Sphere coordinates: https://vvvv.org/blog/polar-spherical-and-geographic-coordinates
+    zs = ( np.sin(lat) )#
+    return healpy.vec2pix( int(res) , xs , ys , zs )
+
+parser = argparse.ArgumentParser()
+parser.add_argument("input_file")
+parser.add_argument("output_file")
+
+args = parser.parse_args()
+
+df = pd.read_csv(args.input_file,sep="\t") 
+df["healpix_256"] = df.progress_apply(lambda row:latlon2healpix(lat=row.latitude,lon=row.longitude,res=256),axis=1)
+df["healpix_64"] = df.progress_apply(lambda row:latlon2healpix(lat=row.latitude,lon=row.longitude,res=64),axis=1)
+df["healpix_32"] = df.progress_apply(lambda row:latlon2healpix(lat=row.latitude,lon=row.longitude,res=32),axis=1)
+df["healpix_1"] = df.progress_apply(lambda row:latlon2healpix(lat=row.latitude,lon=row.longitude,res=1),axis=1)
+
+df.to_csv(args.output_file,sep="\t",index=False)
\ No newline at end of file
diff --git a/scripts/randoludo.py b/scripts/randoludo.py
new file mode 100644
index 0000000000000000000000000000000000000000..3862d526b15f9d337f791cefa4e81e037c64682f
--- /dev/null
+++ b/scripts/randoludo.py
@@ -0,0 +1,50 @@
+import pandas as pd
+import numpy as np
+
+from lib.geocoder import Geocoder
+
+geocoder = Geocoder("./outputs/IGN_4_100_A_C.h5","./outputs/IGN_4_100_A_C_index")
+
+df = pd.read_csv("data/rando_toponymes.tsv",sep="\t")
+df["name"]=df.name.apply(lambda x:x.split("¦")[0])
+
+def heuristic_mean(toponyms):
+	input_ = np.asarray([[t1,t2] for t2 in toponyms for t1 in toponyms if t2 != t1])
+	if len(input_)<1:
+		input_=np.asarray([[toponyms[0],toponyms[0]]])
+	res_geocode = pd.DataFrame(input_,columns="t tc".split())
+	lons,lats = geocoder.wgs_coord(*geocoder.get_coords(input_[:,0],input_[:,1]))
+	res_geocode["lon"] = lons
+	res_geocode["lat"] = lats
+	results = {}
+	for tp in toponyms:
+		lat = res_geocode[res_geocode.t == tp].lat.mean()
+		lon = res_geocode[res_geocode.t == tp].lon.mean()
+		results[tp]={"lat":lat,"lon":lon}
+	return results
+
+def heuristic_one_couple(toponyms):
+	input_ = np.asarray([[t1,t2] for t2 in toponyms for t1 in toponyms if t2 == t1])
+	if len(input_)<1:
+		input_=np.asarray([[toponyms[0],toponyms[0]]])
+	res_geocode = pd.DataFrame(input_,columns="t tc".split())
+	lons,lats = geocoder.wgs_coord(*geocoder.get_coords(input_[:,0],input_[:,1]))
+	res_geocode["lon"] = lons
+	res_geocode["lat"] = lats
+	results = {}
+	for tp in toponyms:
+		lat = res_geocode[res_geocode.t == tp].lat.mean()
+		lon = res_geocode[res_geocode.t == tp].lon.mean()
+		results[tp]={"lat":lat,"lon":lon}
+	return results
+
+results_fin = []
+for ix,group in df.groupby("filename"):
+    res_geocode = heuristic_one_couple(group.name_gazetteer.values)
+    results_fin.extend(group.name_gazetteer.apply(lambda x : res_geocode[x]).values.tolist())
+dd = pd.DataFrame(results_fin).rename(columns={"lat":"lat_pred","lon":"lon_pred"})
+df2 = pd.concat((df,dd),axis=1)
+
+from lib.utils_geo import haversine_pd
+df2["dist_error"] = haversine_pd(df2.longitude,df2.latitude,df2.lon_pred,df2.lat_pred)
+print(df2.dist_error.mean())
diff --git a/templates/pair_topo.html b/templates/pair_topo.html
new file mode 100644
index 0000000000000000000000000000000000000000..3366643bf04e1fe1f04333ab99749d021094ac6a
--- /dev/null
+++ b/templates/pair_topo.html
@@ -0,0 +1,55 @@
+{% extends 'skeleton.html' %}
+
+{% block content %}
+<!-- MAP RENDER -->
+<div id="mapid"></div>
+
+<!-- TOPONYM FORM -->
+<div class="container" style="background-color: white;padding: 5px;">
+    <p>
+
+    </p>
+    <form action="/" method="get">
+        <div class="form-group">
+            <label for="formGroupExampleInput">Toponym</label>
+            <input type="text" class="form-control" name="top"
+                placeholder="Paris">
+        </div>
+        <div class="form-group">
+            <label for="formGroupExampleInput2">Context Toponym</label>
+            <input type="text" class="form-control" name="c_top"
+                placeholder="Cherbourg">
+        </div>
+        <button type="submit" class="btn btn-primary">Get Coords !</button>
+    </form>
+</div>
+{% endblock %}
+
+{% block script %}
+<script>
+
+    // Initialize the map
+    // [50, -0.1] are the latitude and longitude
+    // 4 is the zoom
+    // mapid is the id of the div where the map will appear
+    var mymap = L
+        .map('mapid')
+        .setView([50, -0.1], 4);
+
+    // Add a tile to the map = a background. Comes from OpenStreetmap
+    L.tileLayer(
+        'https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
+        attribution: 'Map data &copy; <a href="https://www.openstreetmap.org/">OpenStreetMap</a>',
+    }).addTo(mymap);
+    {% if lat and lon %}
+    var marker = L.marker([{{lat}}, {{lon}}]).addTo(mymap);
+    var circle = L.circle([{{lat}}, {{lon}}], {
+          color: "red",
+          fillColor: "#f03",
+          fillOpacity: 0.5,
+          radius: 100000.0
+      }).addTo(mymap); 
+    {% endif %}
+
+</script>
+{% endblock %}
\ No newline at end of file
diff --git a/templates/skeleton.html b/templates/skeleton.html
new file mode 100644
index 0000000000000000000000000000000000000000..8687b4e0782413fc883a05ab8bac486814f76840
--- /dev/null
+++ b/templates/skeleton.html
@@ -0,0 +1,116 @@
+<!DOCTYPE html>
+<html lang="en">
+
+<head>
+    <meta charset="UTF-8">
+    <meta name="viewport" content="width=auto, initial-scale=1.0">
+    <meta http-equiv="X-UA-Compatible" content="ie=edge">
+    <title>Geocoder Interface</title>
+    <link rel="stylesheet" href="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/css/bootstrap.min.css" integrity="sha384-Vkoo8x4CGsO3+Hhxv8T/Q5PaXtkKtu6ug5TOeNV6gBiFeWPGFN9MuhOf23Q9Ifjh" crossorigin="anonymous">
+
+    <link href="https://fonts.googleapis.com/css2?family=Kumbh+Sans:wght@300;400;700&display=swap" rel="stylesheet">
+
+    <!-- Load Leaflet -->
+    <link rel="stylesheet" href="https://unpkg.com/leaflet@1.3.4/dist/leaflet.css" integrity="sha512-puBpdR0798OZvTTbP4A8Ix/l+A4dHDD0DGqYW6RQ+9jxkRFclaxxQb/SJAWZfWAkuyeQUytO7+7N4QKrDh+drA==" crossorigin="" />
+    <script src="https://unpkg.com/leaflet@1.3.4/dist/leaflet.js" integrity="sha512-nMMmRyTVoLYqjP9hrbed9S+FzjZHW5gY1TWCHA5ckwXZBadntCNs8kEqAWdrb9O7rxbCaA4lKTIWjDXZxflOcA==" crossorigin=""></script>
+</head>
+
+<body>
+    <style>
+        body {
+            font-family: 'Kumbh Sans', sans-serif;
+        }
+        
+        #mapid {
+            height: 400px;
+            width: 100%;
+        }
+        
+        .container-fluid {
+            padding: 0 !important;
+        }
+        
+        .text_annotated {
+            line-height: 2em;
+        }
+        
+        .annotation {
+            border-radius: 5px;
+            color: white;
+            padding: 4px;
+        }
+        
+        .person {
+            background-color: #cf000f;
+        }
+        
+        .place {
+            background-color: #2c82c9;
+        }
+        
+        .org {
+            background-color: #00b16a;
+        }
+    </style>
+
+    <main class="container-fluid">
+        <!-- NAVBAR -->
+        <nav class="navbar navbar-expand-lg navbar-light bg-light">
+            <a class="navbar-brand" href="#">Geocoding using pair of toponyms</a>
+            <button class="navbar-toggler" type="button" data-toggle="collapse" data-target="#navbarNavAltMarkup" aria-controls="navbarNavAltMarkup" aria-expanded="false" aria-label="Toggle navigation">
+              <span class="navbar-toggler-icon"></span>
+            </button>
+            <div class="collapse navbar-collapse" id="navbarNavAltMarkup">
+                <div class="navbar-nav">
+                    <a class="nav-link" href="/">Toponyms Pair Geocoder</a>
+                    <a class="nav-link" href="/text">Text Geocoder</a>
+
+
+                </div>
+                <div class="navbar-nav ml-auto">
+                    <li class="nav-item dropdown">
+                        <a class="nav-link dropdown-toggle" href="#" id="navbarModelDropdown" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
+                          Choose Model
+                        </a>
+
+                        <div class="dropdown-menu" aria-labelledby="navbarModelDropdown">
+                            {% for id_ in dict_model %}
+                            <a class="dropdown-item" href="/loadmodel/{{id_}}">{{id_}}</a>
+                            <br>{% endfor %}
+                        </div>
+                    </li>
+                    <li class="nav-item dropdown">
+                        <a class="nav-link dropdown-toggle" href="#" id="navbarDropdown" role="button" data-toggle="dropdown" aria-haspopup="true" aria-expanded="false">
+                        Choose Lang for Spacy
+                        </a>
+                        <div class="dropdown-menu" aria-labelledby="navbarDropdown">
+                            <a class="dropdown-item" href="/loadlang/fr">fr</a>
+                            <a class="dropdown-item" href="/loadlang/en">en</a>
+                        </div>
+                    </li>
+                </div>
+            </div>
+        </nav>
+        {% if request.args.get("msg","") != "" %}
+        <div class="alert alert-{{msg_code}} alert-dismissible fade show" role="alert">
+            {{request.args.get("msg") }}
+            <button type="button" class="close" data-dismiss="alert" aria-label="Close">
+              <span aria-hidden="true">&times;</span>
+            </button>
+        </div>
+
+        {% endif %}
+        <h2 class="text-center" style="margin-top: 0.5em;">{{title}}</h2>
+
+        <br>{% block content %}{% endblock %}
+    </main>
+
+    <!-- JS SCRIPTS -->
+    <script src="https://code.jquery.com/jquery-3.4.1.slim.min.js" integrity="sha384-J6qa4849blE2+poT4WnyKhv5vZF5SrPo0iEjwBvKU7imGFAV0wwj1yYfoRSJoZ+n" crossorigin="anonymous"></script>
+    <script src="https://cdn.jsdelivr.net/npm/popper.js@1.16.0/dist/umd/popper.min.js" integrity="sha384-Q6E9RHvbIyZFJoft+2mJbHaEWldlvI9IOYy5n3zV9zzTtmI3UksdQRVvoxMfooAo" crossorigin="anonymous"></script>
+    <script src="https://stackpath.bootstrapcdn.com/bootstrap/4.4.1/js/bootstrap.min.js" integrity="sha384-wfSDF2E50Y2D1uUdj0O3uMBJnjuUD4Ih7YwaYd1iqfktj0Uod8GCExl3Og8ifwB6" crossorigin="anonymous"></script>
+
+    {% block script%} {% endblock %}
+</body>
+
+</html>
\ No newline at end of file
diff --git a/templates/text.html b/templates/text.html
new file mode 100644
index 0000000000000000000000000000000000000000..677b26e94220b349f834c331c5a92a5ac6dee561
--- /dev/null
+++ b/templates/text.html
@@ -0,0 +1,60 @@
+{% extends 'skeleton.html' %} {% block content %}
+<!-- MAP RENDER -->
+<div id="mapid"></div>
+
+<!-- TOPONYM FORM -->
+<div class="container">
+    <p>
+        
+    </p>
+    <form action="/geocode" method="post">
+        <div class="form-group">
+            <label for="text"><h5>Your text</h5></label>
+            <textarea class="form-control" id="text" name="text" rows="3"></textarea>
+            <br>
+            <button class="btn btn-info" id="submit">Geocode</button>
+        </div>
+    </form>
+    <h5>Results :</h5>
+    <div class="col-lg-12" style="border:1px solid #999;border-radius:5px;">
+        <p class="text_annotated" id="result_container">
+            {% if data %} {{data["output"]}} {% endif %}
+        </p>
+    </div>
+</div>
+
+{% endblock %} {% block script %}
+<script>
+    // Initialize the map
+    // [50, -0.1] are the latitude and longitude
+    // 4 is the zoom
+    // mapid is the id of the div where the map will appear
+    var mymap = L
+        .map('mapid')
+        .setView([50, -0.1], 4);
+
+    // Add a tile to the map = a background. Comes from OpenStreetmap
+    L.tileLayer(
+        'https://{s}.tile.openstreetmap.org/{z}/{x}/{y}.png', {
+            attribution: 'Map data &copy; <a href="https://www.openstreetmap.org/">OpenStreetMap</a>',
+        }).addTo(mymap);
+    
+</script>
+{% if data %} 
+<script>
+    {% for place,coords in data["place_coords"].items() %}
+        var mark = L.marker([{{coords["lat"]}}, {{coords["lon"]}}],);
+        mark.bindPopup("{{place}}")
+        mark.addTo(mymap);
+        var circle = L.circle([{{coords["lat"]}}, {{coords["lon"]}}], {
+          color: "red",
+          fillColor: "#f03",
+          fillOpacity: 0.5,
+          radius: 100000.0
+      }).addTo(mymap); 
+
+    {% endfor %}
+</script>
+{% endif %}
+
+{% endblock %}
\ No newline at end of file
diff --git a/train_bert_geocoder.py b/train_bert_geocoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..d0749b6bbd2c13bd490310805e0bed4abde237e6
--- /dev/null
+++ b/train_bert_geocoder.py
@@ -0,0 +1,353 @@
+# REQUIREMENTS : pandas keras torch numpy transformers
+
+"""
+Based from the article : https://mccormickml.com/2019/07/22/BERT-fine-tuning/
+by Chris McCormick
+
+"""
+import os
+import sys
+import time
+import random
+import argparse
+import datetime
+
+import pandas as pd
+import numpy as np
+
+import tensorflow as tf
+import torch
+
+from tqdm import tqdm
+tqdm.pandas()
+
+from torch.utils.data import TensorDataset, DataLoader, RandomSampler, SequentialSampler
+from keras.preprocessing.sequence import pad_sequences
+from transformers import BertTokenizer
+from transformers import BertForSequenceClassification, AdamW, BertConfig
+from transformers import get_linear_schedule_with_warmup
+
+from sklearn import preprocessing
+from joblib import dump
+
+def flat_accuracy(preds, labels):
+    pred_flat = np.argmax(preds, axis=1).flatten()
+    labels_flat = labels.flatten()
+    return np.sum(pred_flat == labels_flat) / len(labels_flat)
+
+
+def format_time(elapsed):
+    '''
+    Takes a time in seconds and returns a string hh:mm:ss
+    '''
+    # Round to the nearest second.
+    elapsed_rounded = int(round((elapsed)))
+
+    # Format as hh:mm:ss
+    return str(datetime.timedelta(seconds=elapsed_rounded))
+
+parser = argparse.ArgumentParser()
+
+parser.add_argument("train" ,help="TSV with two columns : 'sentence' and 'label'")
+parser.add_argument("test",help="TSV with two columns : 'sentence' and 'label'")
+parser.add_argument("outputdir",help="TSV with two columns : 'sentence' and 'label'")
+parser.add_argument("-e","--epochs",type=int,default=5)
+parser.add_argument("-b","--batch_size",default=32,type=int)
+
+args = parser.parse_args()#("-b 32 -e 10 cooc_adj_bert_train.csv cooc_adj_bert_test.csv output_bert_allcooc_adjsampling3radius20km_batch32_epoch10".split())
+
+if not os.path.exists(args.train) or not os.path.exists(args.test):
+    raise FileNotFoundError("Train or Test filepath is incorrect !")
+
+# Number of training epochs (authors recommend between 2 and 4)
+epochs = args.epochs
+
+# The DataLoader needs to know the batch size for training, so I specify it here.
+# For fine-tuning BERT on a specific task, the authors recommend a batch size of
+# 16 or 32.
+
+batch_size = args.batch_size
+
+# OUTPUT DIR
+output_dir = args.outputdir
+
+if not os.path.exists(args.outputdir):
+    raise FileNotFoundError("{0} directory does not exists ! ".format(args.output_dir))
+if not os.path.isdir(args.outputdir):
+    raise NotADirectoryError("{0} is not a directory".format(args.output_dir))
+
+df_train = pd.read_csv(args.train, sep="\t")
+df_test = pd.read_csv(args.test, sep="\t")
+
+label_encoder = preprocessing.LabelEncoder()
+label_encoder.fit(np.concatenate((df_train.label.values,df_test.label.values)))
+dump(label_encoder,filename=output_dir+"/label_encoder.dump")
+
+df_train["label"] = label_encoder.transform(df_train.label.values)
+df_test["label"] = label_encoder.transform(df_test.label.values)
+
+# Get the GPU device name.
+device_name = tf.test.gpu_device_name()
+
+# The device name should look like the following:
+if device_name == '/device:GPU:0':
+    print('Found GPU at: {}'.format(device_name))
+else:
+    raise SystemError('GPU device not found')
+
+# If there's a GPU available...
+if torch.cuda.is_available():    
+
+    # Tell PyTorch to use the GPU.    
+    device = torch.device("cuda")
+    print('There are %d GPU(s) available.' % torch.cuda.device_count())
+    print('We will use the GPU:', torch.cuda.get_device_name(0))
+# If not...
+else:
+    print('No GPU available, using the CPU instead.')
+    device = torch.device("cpu")
+
+
+# Load the BERT tokenizer.
+print('Loading {0} tokenizer...'.format("bert-base-multilingual-cased"))
+tokenizer = BertTokenizer.from_pretrained('bert-base-multilingual-cased',do_lower_case=False)
+
+from lib.torch_generator import SentenceDataset
+# Create the DataLoader for training set.
+train_data = SentenceDataset(df_train,tokenizer,batch_size=batch_size)
+train_dataloader = DataLoader(train_data,  batch_size=batch_size)#,sampler=train_sampler,)
+
+# Create the DataLoader for validation set.
+validation_data = SentenceDataset(df_test,tokenizer,batch_size=batch_size)
+#validation_sampler = SequentialSampler(validation_data)
+validation_dataloader = DataLoader(validation_data, batch_size=batch_size)#, sampler=validation_sampler)
+
+# Load BertForSequenceClassification, the pretrained BERT model with a single
+# linear classification layer on top. 
+model = BertForSequenceClassification.from_pretrained(
+    "bert-base-multilingual-cased", # Use the 12-layer BERT model, with an uncased vocab.
+    num_labels = max(df_test.label.max(),df_train.label.max())+1, # The number of output labels--2 for binary classification.
+                    # You can increase this for multi-class tasks.   
+    output_attentions = False, # Whether the model returns attentions weights.
+    output_hidden_states = False, # Whether the model returns all hidden-states.
+)
+
+# Tell pytorch to run this model on the GPU.
+model.cuda()
+
+optimizer = AdamW(model.parameters(),
+                  lr = 2e-5, # args.learning_rate - default is 5e-5, our notebook had 2e-5
+                  eps = 1e-8 # args.adam_epsilon  - default is 1e-8.
+                )
+
+
+
+# Total number of training steps is number of batches * number of epochs.
+total_steps = len(train_data) * epochs
+
+# Create the learning rate scheduler.
+scheduler = get_linear_schedule_with_warmup(optimizer, 
+                                            num_warmup_steps = 0, # Default value in run_glue.py
+                                            num_training_steps = total_steps)
+
+
+
+# Set the seed value all over the place to make this reproducible.
+seed_val = 42
+
+random.seed(seed_val)
+np.random.seed(seed_val)
+torch.manual_seed(seed_val)
+torch.cuda.manual_seed_all(seed_val)
+
+# Store the average loss after each epoch so I can plot them.
+loss_values = []
+
+history = []
+# For each epoch...
+for epoch_i in range(0, epochs):
+    epoch_data={}
+    
+    # ========================================
+    #               Training
+    # ========================================
+    
+    # Perform one full pass over the training set.
+
+    print("")
+    print('======== Epoch {:} / {:} ========'.format(epoch_i + 1, epochs))
+    print('Training...')
+
+    # Measure how long the training epoch takes.
+    t0 = time.time()
+
+    # Reset the total loss for this epoch.
+    total_loss = 0
+
+    # Put the model into training mode.
+    model.train()
+
+    # For each batch of training data...
+    for step, batch in enumerate(train_dataloader):
+
+        # Progress update every 40 batches.
+        if step % 100 == 0 and not step == 0:
+            # Calculate elapsed time in minutes.
+            elapsed = format_time(time.time() - t0)
+            
+            # Report progress.#Changed to sys.stdout to avoid uneccessary \n
+            sys.stdout.write('\r  Batch {:>5,}  of  {:>5,}.    Elapsed: {:}.'.format(step, len(train_dataloader), elapsed))
+
+        # Unpack this training batch from the dataloader. 
+        #
+        # As I unpack the batch, I'll also copy each tensor to the GPU using the 
+        # `to` method.
+        #
+        # `batch` contains three pytorch tensors:
+        #   [0]: input ids 
+        #   [1]: attention masks
+        #   [2]: labels 
+        b_input_ids = batch[0].to(device)
+        b_input_mask = batch[1].to(device)
+        b_labels = batch[2].to(device)
+
+        # Always clear any previously calculated gradients before performing a
+        # backward pass. PyTorch doesn't do this automatically because 
+        # accumulating the gradients is "convenient while training RNNs". 
+        # (source: https://stackoverflow.com/questions/48001598/why-do-we-need-to-call-zero-grad-in-pytorch)
+        model.zero_grad()        
+
+        # Perform a forward pass (evaluate the model on this training batch).
+        # This will return the loss (rather than the model output) because I
+        # have provided the `labels`.
+        # The documentation for this `model` function is here: 
+        # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
+        outputs = model(b_input_ids, 
+                    token_type_ids=None, 
+                    attention_mask=b_input_mask, 
+                    labels=b_labels)
+        
+        # The call to `model` always returns a tuple, so I need to pull the 
+        # loss value out of the tuple.
+        loss = outputs[0]
+
+        # Accumulate the training loss over all of the batches so that I can
+        # calculate the average loss at the end. `loss` is a Tensor containing a
+        # single value; the `.item()` function just returns the Python value 
+        # from the tensor.
+        
+        total_loss += loss.item()
+
+        # Perform a backward pass to calculate the gradients.
+        loss.backward()
+
+        # Clip the norm of the gradients to 1.0.
+        # This is to help prevent the "exploding gradients" problem.
+        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
+
+        # Update parameters and take a step using the computed gradient.
+        # The optimizer dictates the "update rule"--how the parameters are
+        # modified based on their gradients, the learning rate, etc.
+        optimizer.step()
+
+        # Update the learning rate.
+        scheduler.step()
+
+    # Calculate the average loss over the training data.
+    avg_train_loss = total_loss / len(train_dataloader)            
+    
+    # Store the loss value for plotting the learning curve.
+    loss_values.append(avg_train_loss)
+
+    print("")
+    print("  Average training loss: {0:.2f}".format(avg_train_loss))
+    print("  Training epoch took: {:}".format(format_time(time.time() - t0)))
+    epoch_data["loss"]=avg_train_loss
+    epoch_data["epoch_duration"] = time.time() - t0
+
+    # ========================================
+    #               Validation
+    # ========================================
+    # After the completion of each training epoch, measure the performance on
+    # the validation set.
+
+    print("")
+    print("Running Validation...")
+
+    t0 = time.time()
+
+    # Put the model in evaluation mode--the dropout layers behave differently
+    # during evaluation.
+    model.eval()
+
+    # Tracking variables 
+    eval_loss, eval_accuracy = 0, 0
+    nb_eval_steps, nb_eval_examples = 0, 0
+
+    # Evaluate data for one epoch
+    for batch in validation_dataloader:
+        
+        # Add batch to GPU
+        batch = tuple(t.to(device) for t in batch)
+        
+        # Unpack the inputs from dataloader
+        b_input_ids, b_input_mask, b_labels = batch
+        
+        # Telling the model not to compute or store gradients, saving memory and
+        # speeding up validation
+        with torch.no_grad():        
+
+            # Forward pass, calculate logit predictions.
+            # This will return the logits rather than the loss because we have
+            # not provided labels.
+            # token_type_ids is the same as the "segment ids", which 
+            # differentiates sentence 1 and 2 in 2-sentence tasks.
+            # The documentation for this `model` function is here: 
+            # https://huggingface.co/transformers/v2.2.0/model_doc/bert.html#transformers.BertForSequenceClassification
+            outputs = model(b_input_ids, 
+                            token_type_ids=None, 
+                            attention_mask=b_input_mask)
+        
+        # Get the "logits" output by the model. The "logits" are the output
+        # values prior to applying an activation function like the softmax.
+        logits = outputs[0]
+
+        # Move logits and labels to CPU
+        logits = logits.detach().cpu().numpy()
+        label_ids = b_labels.to('cpu').numpy()
+        
+        # Calculate the accuracy for this batch of test sentences.
+        tmp_eval_accuracy = flat_accuracy(logits, label_ids)
+        
+        # Accumulate the total accuracy.
+        eval_accuracy += tmp_eval_accuracy
+
+        # Track the number of batches
+        nb_eval_steps += 1
+
+    
+    # Report the final accuracy for this validation run.
+    print("  Accuracy: {0:.2f}".format(eval_accuracy/nb_eval_steps))
+    print("  Validation took: {:}".format(format_time(time.time() - t0)))
+    epoch_data["accuracy"] = eval_accuracy/nb_eval_steps
+    epoch_data["validation_duration"] = time.time() - t0
+    history.append(epoch_data)
+print("")
+print("Training complete!")
+
+print("Save History")
+pd.DataFrame(history).to_csv(output_dir+"/history_bert.csv",sep="\t")
+
+
+
+# Create output directory if needed
+if not os.path.exists(output_dir):
+    os.makedirs(output_dir)
+
+print("Saving model to %s" % output_dir)
+
+# Save a trained model, configuration and tokenizer using `save_pretrained()`.
+# They can then be reloaded using `from_pretrained()`
+model_to_save = model.module if hasattr(model, 'module') else model  # Take care of distributed/parallel training
+model_to_save.save_pretrained(output_dir)
+tokenizer.save_pretrained(output_dir)
\ No newline at end of file
diff --git a/train_geocoder.py b/train_geocoder.py
new file mode 100644
index 0000000000000000000000000000000000000000..71ff546eba853c4fdbf0c8eeff263021df737c15
--- /dev/null
+++ b/train_geocoder.py
@@ -0,0 +1,401 @@
+# Base module 
+import re
+import os
+import json
+
+# Structure
+import pandas as pd
+import numpy as np
+import geopandas as gpd
+
+# DEEPL module
+from keras.layers import Dense, Input, Embedding,concatenate,Bidirectional,LSTM, Dropout
+from keras.models import Model
+from keras import backend as K
+from keras.callbacks import ModelCheckpoint
+
+import tensorflow as tf
+
+# Geometry
+from shapely.geometry import Point
+
+# Custom module
+from helpers import read_geonames
+from lib.utils_geo import Grid,zero_one_encoding, get_adjacency_rels, get_geonames_inclusion_rel,get_bounds
+from lib.ngram_index import NgramIndex
+from lib.utils import ConfigurationReader
+from lib.metrics import lat_accuracy,lon_accuracy
+from lib.utils_geo import haversine_tf,accuracy_k,haversine_tf_1circle
+
+
+# Logging
+from tqdm import tqdm
+import logging
+from helpers import parse_title_wiki,EpochTimer
+
+logging.getLogger('gensim').setLevel(logging.WARNING)
+
+def get_new_ids(cooc_data,id_first_value):
+    """
+    Return new ids from cooccurrence data
+    
+    Parameters
+    ----------
+    cooc_data : pd.DataFrame
+        cooccurrence da
+    id_first_value : int
+        id beginning value
+    
+    Returns
+    -------
+    dict
+        new ids for each toponyms 
+    """
+    topo_id = {}
+    id_ = id_first_value
+    for title in cooc_data.title.values:
+        if not title in topo_id:
+            id_+=1
+            topo_id[id_]=title
+    for interlinks in cooc_data.interlinks.values:
+        for interlink in interlinks.split("|"):
+            if not interlink in topo_id:
+                id_+=1
+                topo_id[id_]=interlink
+    return topo_id
+
+# LOGGING CONF
+logging.basicConfig(
+    format='[%(asctime)s][%(levelname)s] %(message)s ', 
+    datefmt='%m/%d/%Y %I:%M:%S %p',
+    level=logging.INFO  
+    )
+
+args = ConfigurationReader("./parser_config/toponym_combination_embedding_v2.json")\
+    .parse_args()#("-i -a -w --wikipedia-cooc-fn ../data/wikipedia/cooccurrence_FR.txt -n 4 --ngram-word2vec-iter 1 -e 100 ../data/geonamesData/FR.txt ../data/geonamesData/hierarchy.txt".split())
+
+#
+#################################################
+############# MODEL TRAINING PARAMETER ##########
+#################################################
+MODEL_NAME = "Bi-LSTM_NGRAM"
+NGRAM_SIZE = args.ngram_size
+ACCURACY_TOLERANCE = args.tolerance_value
+EPOCHS = args.epochs
+ITER_ADJACENCY = args.adjacency_iteration
+COOC_SAMPLING_NUMBER = args.cooc_sample_size
+WORDVEC_ITER = args.ngram_word2vec_iter
+EMBEDDING_DIM = 256
+#################################################
+########## FILENAME VARIABLE ####################
+#################################################
+GEONAME_FN = args.geoname_input
+DATASET_NAME = args.geoname_input.split("/")[-1]
+GEONAMES_HIERARCHY_FN = args.geoname_hierachy_input
+REGION_SUFFIX_FN = "" if args.admin_code_1 == "None" else "_" + args.admin_code_1
+ADJACENCY_REL_FILENAME = "{0}_{1}{2}adjacency.json".format(
+        GEONAME_FN,
+        ITER_ADJACENCY,
+        REGION_SUFFIX_FN)
+
+COOC_FN = args.wikipedia_cooc_fn
+PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format(
+    GEONAME_FN.split("/")[-1],
+    EPOCHS,
+    NGRAM_SIZE,
+    ACCURACY_TOLERANCE,
+    REGION_SUFFIX_FN)
+
+REL_CODE=""
+if args.adjacency:
+    PREFIX_OUTPUT_FN += "_A"
+    REL_CODE+= "A"
+if args.inclusion:
+    PREFIX_OUTPUT_FN += "_I"
+    REL_CODE+= "I"
+if args.wikipedia_cooc:
+    PREFIX_OUTPUT_FN += "_C"
+    REL_CODE+= "C"
+
+MODEL_OUTPUT_FN = "outputs/{0}.h5".format(PREFIX_OUTPUT_FN)
+INDEX_FN = "outputs/{0}_index".format(PREFIX_OUTPUT_FN)
+HISTORY_FN = "outputs/{0}.csv".format(PREFIX_OUTPUT_FN)
+
+from lib.utils import MetaDataSerializer
+
+meta_data = MetaDataSerializer(
+    MODEL_NAME,
+    DATASET_NAME,
+    REL_CODE,
+    COOC_SAMPLING_NUMBER,
+    ITER_ADJACENCY,
+    NGRAM_SIZE,
+    ACCURACY_TOLERANCE,
+    EPOCHS,
+    EMBEDDING_DIM,
+    WORDVEC_ITER,
+    INDEX_FN,
+    MODEL_OUTPUT_FN,
+    HISTORY_FN
+)
+meta_data.save("outputs/{0}.json".format(PREFIX_OUTPUT_FN))
+
+#############################################################################################
+################################# LOAD DATA #################################################
+#############################################################################################
+
+# LOAD  Geonames DATA
+logging.info("Load Geonames data...")
+geoname_data = read_geonames(GEONAME_FN).fillna("")
+
+train_indices = set(pd.read_csv(GEONAME_FN+"_train.csv").geonameid.values)
+test_indices = set(pd.read_csv(GEONAME_FN+"_test.csv").geonameid.values)
+
+logging.info("Geonames data loaded!")
+
+# SELECT ENTRY with class == to A and P (Areas and Populated Places)
+filtered = geoname_data[geoname_data.feature_class.isin("A P".split())].copy() # Only take area and populated places
+#CLEAR RAM
+del geoname_data
+
+
+# IF REGION
+if args.admin_code_1 != "None":
+    filtered = filtered[filtered.admin1_code == args.admin_code_1].copy()
+
+# GET BOUNDS AND REDUCE DATA AVAILABLE FIELDS
+filtered = filtered["geonameid name longitude latitude".split()] # KEEP ONLY ID LABEL AND COORD
+
+
+
+#############################################################################################
+################################# RETRIEVE RELATIONSHIPS ####################################
+#############################################################################################
+
+
+# INITIALIZE RELATION STORE
+rel_store = []
+
+# Retrieve adjacency relationships
+if args.adjacency:
+    logging.info("Retrieve adjacency relationships ! ")
+
+    if not os.path.exists(ADJACENCY_REL_FILENAME):
+        bounds = get_bounds(filtered) # Required to get adjacency relationships
+        rel_store.extend(get_adjacency_rels(filtered,bounds,[360,180],ITER_ADJACENCY))
+        json.dump(rel_store,open(ADJACENCY_REL_FILENAME,'w'))
+    else:
+        logging.info("Open and load data from previous computation!")
+        rel_store=json.load(open(ADJACENCY_REL_FILENAME))
+
+    logging.info("{0} adjacency relationships retrieved ! ".format(len(rel_store)))
+
+# Retrieve inclusion relationships
+if args.inclusion:
+    logging.info("Retrieve inclusion relationships ! ")
+
+    cpt_rel = len(rel_store)
+    rel_store.extend(get_geonames_inclusion_rel(filtered,GEONAMES_HIERARCHY_FN))
+
+    logging.info("{0} inclusion relationships retrieved ! ".format(len(rel_store)-cpt_rel))
+
+
+
+if args.wikipedia_cooc:
+    logging.info("Load Wikipedia Cooccurrence data and merge with geonames")
+    
+    cooc_data = pd.read_csv(COOC_FN,sep="\t")
+    cooc_data["title"] = cooc_data.title.apply(parse_title_wiki)
+    cooc_data["interlinks"] = cooc_data.interlinks.apply(parse_title_wiki)
+    id_wikipediatitle = get_new_ids(cooc_data,filtered.geonameid.max())
+    wikipediatitle_id = {v:k for k,v in id_wikipediatitle.items()}
+    title_coord = {row.title: (row.longitude,row.latitude) for _,row in tqdm(cooc_data.iterrows(),total=len(cooc_data))}
+    cooc_data["geonameid"] = cooc_data.title.apply(lambda x: wikipediatitle_id[x])
+    filtered = pd.concat((filtered,cooc_data["geonameid title longitude latitude".split()].rename(columns={"title":"name"}).copy()))
+    train_cooc_indices,test_cooc_indices = pd.read_csv(COOC_FN+"_train.csv",sep="\t"), pd.read_csv(COOC_FN+"_test.csv",sep="\t")
+    if not "title" in train_cooc_indices:
+        train_cooc_indices,test_cooc_indices = pd.read_csv(COOC_FN+"_train.csv"), pd.read_csv(COOC_FN+"_test.csv")
+    train_indices = train_indices.union(set(train_cooc_indices.title.apply(lambda x: wikipediatitle_id[parse_title_wiki(x)]).values))
+    test_indices = test_indices.union(set(test_cooc_indices.title.apply(lambda x: wikipediatitle_id[parse_title_wiki(x)]).values))
+
+    logging.info("Merged with Geonames data !")
+
+    # EXTRACT rel
+    logging.info("Extracting cooccurrence relationships")
+    cpt=0
+    for ix, row in tqdm(cooc_data.iterrows(),total=len(cooc_data),desc="Extracting Wikipedia Cooccurrence"):
+        for inter in np.random.choice(row.interlinks.split("|"),COOC_SAMPLING_NUMBER):
+            cpt+=1
+            rel_store.extend([[row.geonameid,wikipediatitle_id[inter]]])
+    logging.info("Extract {0} cooccurrence relationships !".format(cpt))
+
+
+# STORE ID to name
+geoname2name = dict(filtered["geonameid name".split()].values)
+
+# ENCODING NAME USING N-GRAM SPLITTING
+logging.info("Encoding toponyms to ngram...")
+index = NgramIndex(NGRAM_SIZE)
+
+ # Identify all ngram available
+filtered.name.apply(lambda x : index.split_and_add(x))
+if args.wikipedia_cooc:[index.split_and_add(k) for k in wikipediatitle_id]
+
+geoname2encodedname = {row.geonameid : index.encode(row.name) for row in filtered.itertuples()} #init a dict with the 'geonameid' --> 'encoded toponym' association
+
+if args.wikipedia_cooc:
+    geoname2encodedname.update({v:index.encode(k) for k,v in wikipediatitle_id.items()})
+
+# SAVE THE INDEX TO REUSE THE MODEL
+index.save(INDEX_FN)
+
+logging.info("Done !")
+
+
+#############################################################################################
+################################# ENCODE COORDINATES ########################################
+#############################################################################################
+
+
+
+# Encode each geonames entry coordinates
+geoname_vec = {row.geonameid : zero_one_encoding(row.longitude,row.latitude) for row in filtered.itertuples()}
+# CLEAR RAM
+del filtered
+
+
+EMBEDDING_DIM = 256
+num_words = len(index.index_ngram) # necessary for the embedding matrix 
+
+logging.info("Preparing Input and Output data...")
+
+
+#############################################################################################
+################################# BUILD TRAIN/TEST DATASETS #################################
+#############################################################################################
+
+X_1_train,X_2_train,y_lat_train,y_lon_train=[],[],[],[]
+X_1_test,X_2_test,y_lat_test,y_lon_test=[],[],[],[]
+y_train,y_test = [],[]
+
+for couple in rel_store:
+    geonameId_1,geonameId_2 = couple[0],couple[1]
+    if not geonameId_1 in geoname2encodedname:
+        continue
+    top1,top2 = geoname2encodedname[geonameId_1],geoname2encodedname[geonameId_2]
+    if geonameId_1 in train_indices: #and geonameId_2 in train_indices:
+        
+        X_1_train.append(top1)
+        X_2_train.append(top2)
+
+        y_train.append([geoname_vec[geonameId_1][0],geoname_vec[geonameId_1][1]])
+        #y_lon_train.append(geoname_vec[geonameId_1][0])
+        #y_lat_train.append(geoname_vec[geonameId_1][1])
+    
+    else:
+        X_1_test.append(top1)
+        X_2_test.append(top2)
+
+        y_test.append([geoname_vec[geonameId_1][0],geoname_vec[geonameId_1][1]])
+        #y_lon_test.append(geoname_vec[geonameId_1][0])
+        #y_lat_test.append(geoname_vec[geonameId_1][1])
+
+# NUMPYZE inputs and output lists
+X_1_train = np.array(X_1_train)
+X_2_train = np.array(X_2_train)
+y_lat_train = np.array(y_lat_train)
+y_lon_train = np.array(y_lon_train)
+y_train = np.array(y_train)
+
+X_1_test = np.array(X_1_test)
+X_2_test = np.array(X_2_test)
+y_lat_test = np.array(y_lat_test)
+y_lon_test = np.array(y_lon_test)
+y_test = np.array(y_test)
+
+logging.info("Data prepared !")
+
+
+# check for output dir
+if not os.path.exists("outputs/"):
+    os.makedirs("outputs/")
+
+#############################################################################################
+################################# NGRAM EMBEDDINGS ##########################################
+#############################################################################################
+
+
+logging.info("Generating N-GRAM Embedding...")
+embedding_weights = index.get_embedding_layer(geoname2encodedname.values(),dim= EMBEDDING_DIM,iter=WORDVEC_ITER)
+logging.info("Embedding generated !")
+
+#############################################################################################
+################################# MODEL DEFINITION ##########################################
+#############################################################################################
+
+
+input_1 = Input(shape=(index.max_len,))
+input_2 = Input(shape=(index.max_len,))
+
+embedding_layer = Embedding(num_words, EMBEDDING_DIM,input_length=index.max_len,weights=[embedding_weights],trainable=False)#, trainable=True)
+
+x1 = embedding_layer(input_1)
+x2 = embedding_layer(input_2)
+
+# Each LSTM learn on a permutation of the input toponyms
+x1 = Bidirectional(LSTM(98))(x1)
+x2 = Bidirectional(LSTM(98))(x2)
+
+x = concatenate([x1,x2])#,x3])
+
+x1 = Dense(500,activation="relu")(x)
+# x1 = Dropout(0.3)(x1)
+x1 = Dense(500,activation="relu")(x1)
+# x1 = Dropout(0.3)(x1)
+
+x2 = Dense(500,activation="relu")(x)
+# x2 = Dropout(0.3)(x2)
+x2 = Dense(500,activation="relu")(x2)
+# x2 = Dropout(0.3)(x2)
+
+output_lon = Dense(1,activation="sigmoid",name="Output_LON")(x1)
+output_lat = Dense(1,activation="sigmoid",name="Output_LAT")(x2)
+
+output_coord = concatenate([output_lon,output_lat],name="output_coord")
+
+model = Model(inputs = [input_1,input_2], outputs = output_coord)#input_3
+
+model.compile(loss={"output_coord":haversine_tf_1circle}, optimizer='adam',metrics={"output_coord":accuracy_k(ACCURACY_TOLERANCE)})
+
+# model = Model(inputs = [input_1,input_2], outputs = [output_lon,output_lat])#input_3
+
+# model.compile(loss=['mean_squared_error','mean_squared_error'], optimizer='adam',metrics={"Output_LON":lon_accuracy(),"Output_LAT":lat_accuracy()})
+
+
+#############################################################################################
+################################# TRAINING LAUNCH ###########################################
+#############################################################################################
+
+checkpoint = ModelCheckpoint(MODEL_OUTPUT_FN + ".part", monitor='loss', verbose=1,
+    save_best_only=True, mode='auto', period=1)
+
+epoch_timer = EpochTimer("outputs/"+PREFIX_OUTPUT_FN+"_epoch_timer_output.csv")
+
+
+history = model.fit(x=[X_1_train,X_2_train],
+    y=y_train,#[y_lon_train,y_lat_train],
+    verbose=True, batch_size=100,
+    epochs=EPOCHS,
+    validation_data=([X_1_test,X_2_test],y_test),#[y_lon_test,y_lat_test]),
+    callbacks=[checkpoint,epoch_timer])
+
+
+hist_df = pd.DataFrame(history.history)
+hist_df.to_csv(HISTORY_FN)
+
+model.save(MODEL_OUTPUT_FN)
+
+# Erase Model Checkpoint file
+if os.path.exists(MODEL_OUTPUT_FN + ".part"):
+    import shutil
+    shutil.rmtree(MODEL_OUTPUT_FN + ".part")
\ No newline at end of file
diff --git a/train_geocoder_v2.py b/train_geocoder_v2.py
new file mode 100644
index 0000000000000000000000000000000000000000..b44fada124eba4215d899f4350a68d1719d52956
--- /dev/null
+++ b/train_geocoder_v2.py
@@ -0,0 +1,242 @@
+# Base module 
+import os
+import sys
+
+# Structure
+import pandas as pd
+import numpy as np
+
+# DEEPL module
+from keras.layers import Dense, Input, Embedding,concatenate,Bidirectional,LSTM, Dropout
+from keras.models import Model
+from keras.callbacks import ModelCheckpoint
+
+# Custom module
+from lib.utils_geo import zero_one_encoding
+from lib.ngram_index import NgramIndex
+from lib.word_index import WordIndex
+from lib.utils import ConfigurationReader
+from lib.utils_geo import accuracy_k,haversine_tf_1circle
+from helpers import EpochTimer
+from lib.datageneratorv4 import DataGenerator
+
+# Logging
+import logging
+logging.getLogger('gensim').setLevel(logging.WARNING)
+logging.basicConfig( # LOGGING CONF
+    format='[%(asctime)s][%(levelname)s] %(message)s ', 
+    datefmt='%m/%d/%Y %I:%M:%S %p',
+    level=logging.INFO  
+    )
+
+import tensorflow as tf
+try:
+    physical_devices = tf.config.list_physical_devices('GPU')
+    tf.config.experimental.set_memory_growth(physical_devices[0], enable=True)
+except:
+    print("NO GPU FOUND...")
+
+# COMMAND ARGS
+args = ConfigurationReader("./parser_config/toponym_combination_embedding_v3.json")\
+    .parse_args()#("IGN ../data/IGN/IGN_inclusion.csv ../data/IGN/IGN_adjacent_corrected.csv ../data/IGN/IGN_cooc.csv -i -w  -a -n 4 --ngram-word2vec-iter 1".split())
+
+#
+#################################################
+############# MODEL TRAINING PARAMETER ##########
+#################################################
+MODEL_NAME = "Bi-LSTM_NGRAM"
+NGRAM_SIZE = args.ngram_size
+ACCURACY_TOLERANCE = args.tolerance_value
+EPOCHS = args.epochs
+WORDVEC_ITER = args.ngram_word2vec_iter
+EMBEDDING_DIM = args.dimension
+#################################################
+########## FILENAME VARIABLE ####################
+#################################################
+INCLUSION_FN = args.geoname_inclusion
+ADJACENT_FN = args.geonames_adjacent
+COOC_FN = args.wikipedia_cooc
+
+DATASET_NAME = args.dataset_name
+
+PREFIX_OUTPUT_FN = DATASET_NAME
+PREFIX_OUTPUT_FN+="_{0}".format(NGRAM_SIZE)
+PREFIX_OUTPUT_FN+="_{0}".format(EPOCHS)
+
+if args.adjacency:
+    PREFIX_OUTPUT_FN += "_A"
+if args.inclusion:
+    PREFIX_OUTPUT_FN += "_I"
+if args.wikipedia:
+    PREFIX_OUTPUT_FN += "_P"
+
+MODEL_OUTPUT_FN = "outputs/{0}.h5".format(PREFIX_OUTPUT_FN)
+INDEX_FN = "outputs/{0}_index".format(PREFIX_OUTPUT_FN)
+HISTORY_FN = "outputs/{0}.csv".format(PREFIX_OUTPUT_FN)
+
+#############################################################################################
+################################# LOAD DATA #################################################
+#############################################################################################
+
+data_used = []
+
+if args.wikipedia:
+    data_used.append(pd.read_csv(COOC_FN,sep="\t"))
+
+if args.inclusion:
+    data_used.append(pd.read_csv(INCLUSION_FN,sep="\t"))
+
+if args.adjacency:
+    data_used.append(pd.read_csv(ADJACENT_FN, sep="\t"))
+
+if  len(data_used) <1:
+    print("No Type of toponyms indicated. Stopping the program...")
+    sys.exit(1)
+
+pairs_of_toponym = pd.concat(data_used)
+
+#############################################################################################
+################################# RETRIEVE RELATIONSHIPS ####################################
+#############################################################################################
+
+# ENCODING NAME USING N-GRAM SPLITTING
+logging.info("Encoding toponyms to ngram...")
+index = NgramIndex(NGRAM_SIZE)
+if args.tokenization_method == "word-level":
+    index = WordIndex()
+if args.tokenization_method == "bert":
+    index = NgramIndex(NGRAM_SIZE,bert_tokenization=True)
+
+ # Identify all ngram available
+pairs_of_toponym.toponym.apply(lambda x : index.split_and_add(x))
+pairs_of_toponym.toponym_context.apply(lambda x : index.split_and_add(x))
+
+num_words = len(index.index_ngram) # necessary for the embedding matrix
+
+# SAVE THE INDEX TO REUSE THE MODEL
+index.save(INDEX_FN)
+logging.info("Done !")
+
+#############################################################################################
+################################# NGRAM EMBEDDINGS ##########################################
+#############################################################################################
+
+logging.info("Generating N-GRAM Embedding...")
+embedding_weights = index.get_embedding_layer([index.encode(p) for p in np.concatenate((pairs_of_toponym.toponym.unique(),pairs_of_toponym.toponym_context.unique()))],dim= EMBEDDING_DIM,iter=WORDVEC_ITER)
+logging.info("Embedding generated !")
+
+#############################################################################################
+################################# BUILD TRAIN/TEST DATASETS #################################
+#############################################################################################
+logging.info("Preparing Input and Output data...")
+
+training_generator = DataGenerator(pairs_of_toponym[pairs_of_toponym.split == "train"],index)
+validation_generator = DataGenerator(pairs_of_toponym[pairs_of_toponym.split == "test"],index)
+# X_1_train,X_2_train=[],[]
+# X_1_test,X_2_test=[],[]
+# y_train,y_test = [],[]
+
+# for couple in pairs_of_toponym["toponym toponym_context split longitude latitude".split()].itertuples():
+#     top,top_c,split_ = couple[1], couple[2], couple[3]
+#     coord = zero_one_encoding(couple[-2],couple[-1]) # 0 and 1 encoding
+#     enc_top, enc_top_c = index.encode(top),index.encode(top_c)
+#     if split_ == "train":
+#         X_1_train.append(enc_top)
+#         X_2_train.append(enc_top_c)
+#         y_train.append(list(coord))
+#     else:
+#         X_1_test.append(enc_top)
+#         X_2_test.append(enc_top_c)
+#         y_test.append(list(coord))
+
+# # "NUMPYZE" inputs and output lists
+# X_1_train = np.array(X_1_train)
+# X_2_train = np.array(X_2_train)
+# y_train = np.array(y_train)
+
+# X_1_test = np.array(X_1_test)
+# X_2_test = np.array(X_2_test)
+# y_test = np.array(y_test)
+
+logging.info("Data prepared !")
+
+
+# check for output dir
+if not os.path.exists("outputs/"):
+    os.makedirs("outputs/")
+
+
+#############################################################################################
+################################# MODEL DEFINITION ##########################################
+#############################################################################################
+
+input_1 = Input(shape=(index.max_len,))
+input_2 = Input(shape=(index.max_len,))
+
+embedding_layer = Embedding(num_words, EMBEDDING_DIM,input_length=index.max_len,weights=[embedding_weights],trainable=False)#, trainable=True)
+
+x1 = embedding_layer(input_1)
+x2 = embedding_layer(input_2)
+
+# Each LSTM learn on a permutation of the input toponyms
+if args.lstm_layer == 2:
+    x1 = Bidirectional(LSTM(100))(x1)
+    x2 = Bidirectional(LSTM(100))(x2)
+    x = concatenate([x1,x2])
+else:
+    lstm_unique_layer = Bidirectional(LSTM(100))
+    x1 = lstm_unique_layer(x1)
+    x2 = lstm_unique_layer(x2)
+    x = concatenate([x1,x2])
+
+x1 = Dense(500,activation="relu")(x)
+x1 = Dense(500,activation="relu")(x1)
+
+x2 = Dense(500,activation="relu")(x)
+x2 = Dense(500,activation="relu")(x2)
+
+output_lon = Dense(1,activation="sigmoid",name="Output_LON")(x1)
+output_lat = Dense(1,activation="sigmoid",name="Output_LAT")(x2)
+
+output_coord = concatenate([output_lon,output_lat],name="output_coord")
+
+model = Model(inputs = [input_1,input_2], outputs = output_coord)#input_3
+model.compile(loss={"output_coord":haversine_tf_1circle}, optimizer='adam',metrics={"output_coord":accuracy_k(ACCURACY_TOLERANCE)})
+
+print("Neural Network Architecture : ")
+print(model.summary())
+#############################################################################################
+################################# TRAINING LAUNCH ###########################################
+#############################################################################################
+
+checkpoint = ModelCheckpoint(MODEL_OUTPUT_FN + ".part", monitor='loss', verbose=1,
+    save_best_only=True, mode='auto', period=1)
+
+epoch_timer = EpochTimer("outputs/"+PREFIX_OUTPUT_FN+"_epoch_timer_output.csv")
+
+
+
+history = model.fit(training_generator,verbose=True,
+                    validation_data=validation_generator,
+                    callbacks=[checkpoint,epoch_timer],epochs=EPOCHS)
+
+# history = model.fit(x=[X_1_train,X_2_train],
+#     y=y_train,
+#     verbose=True, batch_size=100,
+#     epochs=EPOCHS,
+#     validation_data=([X_1_test,X_2_test],y_test),#[y_lon_test,y_lat_test]),
+#     callbacks=[checkpoint,epoch_timer])
+
+
+hist_df = pd.DataFrame(history.history)
+hist_df.to_csv(HISTORY_FN)
+
+model.save(MODEL_OUTPUT_FN)
+
+# Erase Model Checkpoint file
+if os.path.exists(MODEL_OUTPUT_FN + ".part"):
+    try:
+        import shutil
+        shutil.rmtree(MODEL_OUTPUT_FN + ".part")
+    except: # Depends on Keras version
+        os.remove(MODEL_OUTPUT_FN + ".part")
\ No newline at end of file
diff --git a/train_test_split_cooccurrence_data.py b/train_test_split_cooccurrence_data.py
new file mode 100644
index 0000000000000000000000000000000000000000..47fb607a8b2a64eade2da20d1a3de7159fff5f18
--- /dev/null
+++ b/train_test_split_cooccurrence_data.py
@@ -0,0 +1,68 @@
+import argparse
+
+import pandas as pd
+import numpy as np
+import geopandas as gpd
+
+import logging
+logging.basicConfig(
+    format='[%(asctime)s][%(levelname)s] %(message)s ', 
+    datefmt='%m/%d/%Y %I:%M:%S %p',
+    level=logging.INFO
+    )
+
+from sklearn.model_selection import train_test_split
+from shapely.geometry import Point
+
+from lib.utils_geo import latlon2healpix
+
+from tqdm import tqdm 
+
+parser = argparse.ArgumentParser()
+parser.add_argument("cooccurrence_file")
+parser.add_argument("-s",action="store_true")
+
+args = parser.parse_args()#("data/wikipedia/cooccurrence_FR.txt".split())#("data/geonamesData/FR.txt".split())
+
+# LOAD DATAgeopandas
+COOC_FN = args.cooccurrence_file
+
+logging.info("Load Cooc DATA data...")
+cooc_data = pd.read_csv(COOC_FN,sep="\t").fillna("")
+logging.info("Cooc data loaded!")
+
+
+cooc_data["cat"] = cooc_data.apply(lambda x:latlon2healpix(x.latitude,x.longitude,64),axis=1)
+
+# TRAIN AND TEST SPLIT
+logging.info("Split Between Train and Test")
+
+#  Cell can be empty
+i=0
+while 1:
+    if len(cooc_data[cooc_data.cat == i])> 1:
+        X_train,X_test = train_test_split(cooc_data[cooc_data.cat == i])
+        break
+    i+=1
+
+for i in np.unique(cooc_data.cat.values):
+    try:
+        if not args.s:
+            x_train,x_test = train_test_split(cooc_data[cooc_data.cat == i])
+        else:
+            x_train,x_test = train_test_split(cooc_data[cooc_data.cat == i].sample(frac=0.1))
+
+        X_train,X_test = pd.concat((X_train,x_train)),pd.concat((X_test,x_test))
+    except Exception as e:
+        print(e) #print("Error",len(filtered[filtered.cat == i]))
+
+del X_train["cat"]
+del X_test["cat"]
+
+# SAVING THE DATA
+logging.info("Saving Output !")
+suffix =""
+if args.s:
+    suffix = "10per"
+X_train.to_csv(COOC_FN+suffix+"_train.csv")
+X_test.to_csv(COOC_FN+suffix+"_test.csv")
diff --git a/train_test_split_geonames.py b/train_test_split_geonames.py
new file mode 100644
index 0000000000000000000000000000000000000000..9aaf44907cbe713af2570c4256ad27d447c6ad97
--- /dev/null
+++ b/train_test_split_geonames.py
@@ -0,0 +1,66 @@
+import argparse
+
+import numpy as np
+import pandas as pd
+import geopandas as gpd
+
+import logging
+logging.basicConfig(
+    format='[%(asctime)s][%(levelname)s] %(message)s ', 
+    datefmt='%m/%d/%Y %I:%M:%S %p',
+    level=logging.INFO
+    )
+
+from sklearn.model_selection import train_test_split
+
+from lib.utils_geo import latlon2healpix
+from helpers import read_geonames
+
+from tqdm import tqdm 
+
+parser = argparse.ArgumentParser()
+parser.add_argument("geoname_file")
+parser.add_argument("--feature_classes",help="List of class",default="A P")
+
+args = parser.parse_args()#("data/geonamesData/FR.txt".split())
+
+# LOAD DATAgeopandas
+GEONAME_FN = args.geoname_file
+FEATURE_CLASSES = args.feature_classes
+
+
+logging.info("Load Geonames data...")
+geoname_data = read_geonames(GEONAME_FN).fillna("")
+logging.info("Geonames data loaded!")
+
+# SELECT ENTRY with class == to A and P (Areas and Populated Places)
+filtered = geoname_data[geoname_data.feature_class.isin(FEATURE_CLASSES.split())].copy() # Only take area and populated places
+
+filtered["cat"] = filtered.apply(lambda x:latlon2healpix(x.latitude,x.longitude,64),axis=1)
+# TRAIN AND TEST SPLIT
+logging.info("Split Between Train and Test")
+
+#  Cell can be empty
+cat_unique = filtered.cat.unique()
+ci=0
+while 1:
+    if len(filtered[filtered.cat == cat_unique[ci]])> 1:
+        X_train,X_test = train_test_split(filtered[filtered.cat == cat_unique[ci]])
+        break
+    ci+=1
+
+for i in cat_unique[ci:] :
+    try:
+        x_train,x_test = train_test_split(filtered[filtered.cat == i])
+        X_train,X_test = pd.concat((X_train,x_train)),pd.concat((X_test,x_test))
+    except:
+        pass #print("Error",len(filtered[filtered.cat == i]))
+
+
+del X_train["cat"]
+del X_test["cat"]
+
+# SAVING THE DATA
+logging.info("Saving Output !")
+X_train.to_csv(GEONAME_FN+"_train.csv")
+X_test.to_csv(GEONAME_FN+"_test.csv")
\ No newline at end of file