Skip to content
Snippets Groups Projects
Commit 5a46b9f9 authored by Jacques Fize's avatar Jacques Fize
Browse files

CHANGE README + DEBUG and CLEANING

parent c1530d9e
No related branches found
No related tags found
No related merge requests found
*.png filter=lfs diff=lfs merge=lfs -text
# Work on Place-embedding # Toponym Geocoding
This repo contains various approach around geographic place embedding, and more precisely on its use for geocoding. At this moment, we designed three approaches : Use of ngram representation and colocation of toponyms in geography and text for geocoding
<div style="text-align:center">
<img src="documentation/imgs/LSTM_arch.png"/>
<p><strong>Figure 1</strong> : General workflow</p>
</div>
* Use of geographic places Wikipedia pages to learn an embedding for toponyms
* Use Geonames place topology to produce an embedding using graph-embedding techniques
* Use toponym colocation (combination ?) based on spatial relatationships (inclusion, adjacency) for geocoding
<hr> <hr>
## Setup environnement ## Setup environnement
- Python3.6+ - Python3.6+
- Os free (all dependencies work on Windows !) - Os free (all dependencies should work on Windows !)
It is strongly advised to used Anaconda in a windows environnement! It is strongly advised to used Anaconda in a windows environnement!
...@@ -23,24 +25,32 @@ For Anaconda users ...@@ -23,24 +25,32 @@ For Anaconda users
while read requirement; do conda install --yes $requirement; done < requirements.txt while read requirement; do conda install --yes $requirement; done < requirements.txt
<hr>
## Embedding : train using concatenation of close places
<div style="text-align:center">
<img src="documentation/imgs/third_approach.png"/>
<p><strong>Figure 3</strong> : Third approach general workflow</p>
</div>
<hr>
### Prepare required data ## Prepare required data
### Geonames data
* download the Geonames data use to train the network [here](download.geonames.org/export/dump/) * download the Geonames data use to train the network [here](download.geonames.org/export/dump/)
* download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip) * download the hierarchy data [here](http://download.geonames.org/export/dump/hierarchy.zip)
* unzip both file in the directory of your choice * unzip both file in the directory of your choice
* run the script `train_test_split_geonames.py <geoname_filename>` * run the script `train_test_split_geonames.py <geoname_filename>`
### Train the network ### Cooccurence data
* First, you must download the Wikipedia corpus from which you want to extract co-occurrences : [English Wikipedia Corpus](https://dumps.wikimedia.org/enwiki/20200201/enwiki-20200201-pages-articles.xml.bz2)
* Parse the corpus with Gensim script using the following command : `python3 -m gensim.scripts.segment_wiki -i -f <wikicorpus> -o <1stoutputname>.json.gz`
* Build a page of interest file that contains a list of Wikipedia pages. The file must be a csv with the following column : title,latitude,longitude.<br> You can find [here](https://projet.liris.cnrs.fr/hextgeo/files/place_en_fr_page_clean.csv) a page of interest file that contains places that appears in both FR and EN wikipedia.
* Then using and index that contains pages of interest run the command : `python3 script/get_cooccurrence.py <page_of_interest_file> <2noutputname> -c <1stoutputname>.json.gz`
* Finally, split the resulting dataset with the script `train_test_split_cooccurrence_data.py <2ndoutputname>`
### If you're in a hurry
French Geonames, French Wikipedia cooccurence data, and their train/test splits datasets can be found here : [https://projet.liris.cnrs.fr/hextgeo/files/](https://projet.liris.cnrs.fr/hextgeo/files/)
<hr>
## Train the network
The script `combination_embeddings.py` is the one responsible of the neural network training The script `combination_embeddings.py` is the one responsible of the neural network training
...@@ -51,13 +61,16 @@ To train the network with default parameter use the following command : ...@@ -51,13 +61,16 @@ To train the network with default parameter use the following command :
### Available parameters ### Available parameters
| Parameter | Description | | Parameter | Description |
|----------------------|----------------------------------------------------------------------| |-----------------------|---------------------------------------------------------------------------------|
| -i,--inclusion | Use inclusion relationships to train the network | | -i,--inclusion | Use inclusion relationships to train the network |
| -a,--adjacency | Use adjacency relationships to train the network | | -a,--adjacency | Use adjacency relationships to train the network |
| -w,--wikipedia-coo | Use Wikipedia place co-occurrences to train the network | | -w,--wikipedia-coo | Use Wikipedia place co-occurrences to train the network |
| -n,--ngram-size | ngram size | | --wikipedia-cooc-fn | File that contains the coooccurrence data |
| -t,--tolerance-value | K-value in the computation of the accuracy@k | | --cooc-sample-size- | Number of cooccurence relation selected for each location in cooccurrences data |
| -e,--epochs | number of epochs | | --adjacency-iteration | Number of iteration in the adjacency extraction process |
| -d,--dimension | size of the ngram embeddings | | -n,--ngram-size | ngram size |
| --admin_code_1 | (Optional) If you wish to train the network on a specificate region | | -t,--tolerance-value | K-value in the computation of the accuracy@k |
| -e,--epochs | number of epochs |
| -d,--dimension | size of the ngram embeddings |
| --admin_code_1 | (Optional) If you wish to train the network on a specificate region |
...@@ -80,18 +80,20 @@ EPOCHS = args.epochs ...@@ -80,18 +80,20 @@ EPOCHS = args.epochs
ITER_ADJACENCY = args.adjacency_iteration ITER_ADJACENCY = args.adjacency_iteration
COOC_SAMPLING_NUMBER = args.cooc_sample_size COOC_SAMPLING_NUMBER = args.cooc_sample_size
WORDVEC_ITER = args.ngram_word2vec_iter WORDVEC_ITER = args.ngram_word2vec_iter
EMBEDDING_DIM = 256
################################################# #################################################
########## FILENAME VARIABLE #################### ########## FILENAME VARIABLE ####################
################################################# #################################################
GEONAME_FN = args.geoname_input GEONAME_FN = args.geoname_input
DATASET_NAME = args.geoname_input.split("/")[-1]
GEONAMES_HIERARCHY_FN = args.geoname_hierachy_input GEONAMES_HIERARCHY_FN = args.geoname_hierachy_input
REGION_SUFFIX_FN = "" if args.admin_code_1 == "None" else "_" + args.admin_code_1 REGION_SUFFIX_FN = "" if args.admin_code_1 == "None" else "_" + args.admin_code_1
ADJACENCY_REL_FILENAME = "../data/geonamesData/{0}_{1}{2}adjacency.json".format( ADJACENCY_REL_FILENAME = "{0}_{1}{2}adjacency.json".format(
GEONAME_FN.split("/")[-1], GEONAME_FN,
ITER_ADJACENCY, ITER_ADJACENCY,
REGION_SUFFIX_FN) REGION_SUFFIX_FN)
COOC_FN = "../data/wikipedia/cooccurrence_"+GEONAME_FN.split("/")[-1] COOC_FN = args.wikipedia_cooc_fn
PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format( PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format(
GEONAME_FN.split("/")[-1], GEONAME_FN.split("/")[-1],
EPOCHS, EPOCHS,
...@@ -99,15 +101,39 @@ PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format( ...@@ -99,15 +101,39 @@ PREFIX_OUTPUT_FN = "{0}_{1}_{2}_{3}_{4}".format(
ACCURACY_TOLERANCE, ACCURACY_TOLERANCE,
REGION_SUFFIX_FN) REGION_SUFFIX_FN)
REL_CODE=""
if args.adjacency: if args.adjacency:
PREFIX_OUTPUT_FN += "_A" PREFIX_OUTPUT_FN += "_A"
REL_CODE+= "A"
if args.inclusion: if args.inclusion:
PREFIX_OUTPUT_FN += "_I" PREFIX_OUTPUT_FN += "_I"
REL_CODE+= "I"
if args.wikipedia_cooc: if args.wikipedia_cooc:
PREFIX_OUTPUT_FN += "_C" PREFIX_OUTPUT_FN += "_C"
REL_CODE+= "C"
MODEL_OUTPUT_FN = "outputs/{0}.h5".format(PREFIX_OUTPUT_FN) MODEL_OUTPUT_FN = "outputs/{0}.h5".format(PREFIX_OUTPUT_FN)
INDEX_FN = "outputs/{0}_index".format(PREFIX_OUTPUT_FN) INDEX_FN = "outputs/{0}_index".format(PREFIX_OUTPUT_FN)
HISTORY_FN = "outputs/{0}.csv".format(PREFIX_OUTPUT_FN)
from lib.utils import MetaDataSerializer
meta_data = MetaDataSerializer(
DATASET_NAME,
REL_CODE,
COOC_SAMPLING_NUMBER,
ITER_ADJACENCY,
NGRAM_SIZE,
ACCURACY_TOLERANCE,
EPOCHS,
EMBEDDING_DIM,
WORDVEC_ITER,
INDEX_FN,
MODEL_OUTPUT_FN,
HISTORY_FN
)
meta_data.save("outputs/{0}.json".format(PREFIX_OUTPUT_FN))
############################################################################################# #############################################################################################
################################# LOAD DATA ################################################# ################################# LOAD DATA #################################################
...@@ -231,7 +257,7 @@ geoname_vec = {row.geonameid : zero_one_encoding(row.longitude,row.latitude) for ...@@ -231,7 +257,7 @@ geoname_vec = {row.geonameid : zero_one_encoding(row.longitude,row.latitude) for
del filtered del filtered
embedding_dim = 256 EMBEDDING_DIM = 256
num_words = len(index.index_ngram) # necessary for the embedding matrix num_words = len(index.index_ngram) # necessary for the embedding matrix
logging.info("Preparing Input and Output data...") logging.info("Preparing Input and Output data...")
...@@ -288,7 +314,7 @@ if not os.path.exists("outputs/"): ...@@ -288,7 +314,7 @@ if not os.path.exists("outputs/"):
logging.info("Generating N-GRAM Embedding...") logging.info("Generating N-GRAM Embedding...")
embedding_weights = index.get_embedding_layer(geoname2encodedname.values(),dim= embedding_dim,iter=WORDVEC_ITER) embedding_weights = index.get_embedding_layer(geoname2encodedname.values(),dim= EMBEDDING_DIM,iter=WORDVEC_ITER)
logging.info("Embedding generated !") logging.info("Embedding generated !")
############################################################################################# #############################################################################################
...@@ -299,7 +325,7 @@ logging.info("Embedding generated !") ...@@ -299,7 +325,7 @@ logging.info("Embedding generated !")
input_1 = Input(shape=(index.max_len,)) input_1 = Input(shape=(index.max_len,))
input_2 = Input(shape=(index.max_len,)) input_2 = Input(shape=(index.max_len,))
embedding_layer = Embedding(num_words, embedding_dim,input_length=index.max_len,weights=[embedding_weights],trainable=False)#, trainable=True) embedding_layer = Embedding(num_words, EMBEDDING_DIM,input_length=index.max_len,weights=[embedding_weights],trainable=False)#, trainable=True)
x1 = embedding_layer(input_1) x1 = embedding_layer(input_1)
x2 = embedding_layer(input_2) x2 = embedding_layer(input_2)
...@@ -311,14 +337,14 @@ x2 = Bidirectional(LSTM(98))(x2) ...@@ -311,14 +337,14 @@ x2 = Bidirectional(LSTM(98))(x2)
x = concatenate([x1,x2])#,x3]) x = concatenate([x1,x2])#,x3])
x1 = Dense(500,activation="relu")(x) x1 = Dense(500,activation="relu")(x)
x1 = Dropout(0.3)(x1) # x1 = Dropout(0.3)(x1)
x1 = Dense(500,activation="relu")(x1) x1 = Dense(500,activation="relu")(x1)
x1 = Dropout(0.3)(x1) # x1 = Dropout(0.3)(x1)
x2 = Dense(500,activation="relu")(x) x2 = Dense(500,activation="relu")(x)
x2 = Dropout(0.3)(x2) # x2 = Dropout(0.3)(x2)
x2 = Dense(500,activation="relu")(x2) x2 = Dense(500,activation="relu")(x2)
x2 = Dropout(0.3)(x2) # x2 = Dropout(0.3)(x2)
output_lon = Dense(1,activation="sigmoid",name="Output_LON")(x1) output_lon = Dense(1,activation="sigmoid",name="Output_LON")(x1)
output_lat = Dense(1,activation="sigmoid",name="Output_LAT")(x2) output_lat = Dense(1,activation="sigmoid",name="Output_LAT")(x2)
...@@ -344,7 +370,7 @@ history = model.fit(x=[X_1_train,X_2_train], ...@@ -344,7 +370,7 @@ history = model.fit(x=[X_1_train,X_2_train],
hist_df = pd.DataFrame(history.history) hist_df = pd.DataFrame(history.history)
hist_df.to_csv("outputs/{0}.csv".format(PREFIX_OUTPUT_FN)) hist_df.to_csv(HISTORY_FN)
model.save(MODEL_OUTPUT_FN) model.save(MODEL_OUTPUT_FN)
......
documentation/imgs/LSTM_arch.png

130 B

documentation/imgs/first_approach.png

291 KiB | W: | H:

documentation/imgs/first_approach.png

131 B | W: | H:

documentation/imgs/first_approach.png
documentation/imgs/first_approach.png
documentation/imgs/first_approach.png
documentation/imgs/first_approach.png
  • 2-up
  • Swipe
  • Onion skin
documentation/imgs/second_approach.png

447 KiB | W: | H:

documentation/imgs/second_approach.png

131 B | W: | H:

documentation/imgs/second_approach.png
documentation/imgs/second_approach.png
documentation/imgs/second_approach.png
documentation/imgs/second_approach.png
  • 2-up
  • Swipe
  • Onion skin
documentation/imgs/third_approach.png

30.4 KiB | W: | H:

documentation/imgs/third_approach.png

130 B | W: | H:

documentation/imgs/third_approach.png
documentation/imgs/third_approach.png
documentation/imgs/third_approach.png
documentation/imgs/third_approach.png
  • 2-up
  • Swipe
  • Onion skin
...@@ -78,3 +78,47 @@ class ConfigurationReader(object): ...@@ -78,3 +78,47 @@ class ConfigurationReader(object):
if not input_: if not input_:
return self.parser.parse_args() return self.parser.parse_args()
return self.parser.parse_args(input_) return self.parser.parse_args(input_)
class MetaDataSerializer(object):
def __init__(self,
dataset_name,
rel_code,
cooc_sample_size,
adj_iteration,
ngram_size,
tolerance_value,
epochs,
embedding_dim,
word2vec_iter_nb,
index_fn,
keras_model_fn,
train_test_history_fn):
self.dataset_name = dataset_name
self.rel_code = rel_code
self.cooc_sample_size = cooc_sample_size
self.adj_iteration = adj_iteration
self.ngram_size = ngram_size
self.tolerance_value = tolerance_value
self.epochs = epochs
self.embedding_dim = embedding_dim
self.word2vec_iter_nb = word2vec_iter_nb
self.index_fn = index_fn
self.keras_model_fn = keras_model_fn
self.train_test_history_fn = train_test_history_fn
def save(self,fn):
json.dump({
"dataset_name" : self.dataset_name,
"rel_code" : self.rel_code,
"cooc_sample_size" : self.cooc_sample_size,
"adj_iteration" : self.adj_iteration,
"ngram_size" : self.ngram_size,
"tolerance_value" : self.tolerance_value,
"epochs" : self.epochs,
"embedding_dim" : self.embedding_dim,
"word2vec_iter_nb" : self.word2vec_iter_nb,
"index_fn" : self.index_fn,
"keras_model_fn" : self.keras_model_fn,
"train_test_history_fn" : self.train_test_history_fn
},open(fn,'w'))
\ No newline at end of file
...@@ -7,6 +7,7 @@ ...@@ -7,6 +7,7 @@
{ "short": "-i", "long": "--inclusion", "action": "store_true" }, { "short": "-i", "long": "--inclusion", "action": "store_true" },
{ "short": "-a", "long": "--adjacency", "action": "store_true" }, { "short": "-a", "long": "--adjacency", "action": "store_true" },
{ "short": "-w", "long": "--wikipedia-cooc", "action": "store_true" }, { "short": "-w", "long": "--wikipedia-cooc", "action": "store_true" },
{ "long": "--wikipedia-cooc-fn","help":"Cooccurrence data filename"},
{ "long": "--cooc-sample-size", "type": "int", "default": 3 }, { "long": "--cooc-sample-size", "type": "int", "default": 3 },
{"long": "--adjacency-iteration", "type":"int","default":1}, {"long": "--adjacency-iteration", "type":"int","default":1},
{ "short": "-n", "long": "--ngram-size", "type": "int", "default": 2 }, { "short": "-n", "long": "--ngram-size", "type": "int", "default": 2 },
......
import gzip
import json
import re
import argparse
import pandas as pd
from joblib import Parallel,delayed
from tqdm import tqdm
parser = argparse.ArgumentParser()
parser.add_argument("page_of_interest_fn")
parser.add_argument("output_fn")
parser.add_argument("-c","--corpus",action="append")
args = parser.parse_args()#("../wikidata/sample/place_en_fr_page_clean_onlyfrplace.csv test.txt -c frwiki-latest.json.gz -c enwiki-latest.json.gz".split())
PAGES_OF_INTEREST_FILE = args.page_of_interest_fn
WIKIPEDIA_CORPORA = args.corpus
OUTPUT_FN = args.output_fn
if len(WIKIPEDIA_CORPORA)<1:
raise Exception('No corpora was given!')
df = pd.read_csv(PAGES_OF_INTEREST_FILE)
page_of_interest = set(df.title.values)
page_coord = {row.title : (row.longitude,row.latitude) for ix,row in df.iterrows()}
output = open(OUTPUT_FN,'w')
output.write("title\tinterlinks\tlongitude\tlatitude\n")
for wikipedia_corpus in WIKIPEDIA_CORPORA:
for line in tqdm(gzip.GzipFile(wikipedia_corpus,'rb')):
data = json.loads(line)
if data["title"] in page_of_interest:
occ = page_of_interest.intersection(data["interlinks"].keys())
coord = page_coord[data["title"]]
if len(occ) >0:output.write(data["title"]+"\t"+"|".join(occ)+"\t{0}\t{1}".format(*coord)+"\n")
...@@ -20,7 +20,7 @@ from tqdm import tqdm ...@@ -20,7 +20,7 @@ from tqdm import tqdm
parser = argparse.ArgumentParser() parser = argparse.ArgumentParser()
parser.add_argument("cooccurrence_file") parser.add_argument("cooccurrence_file")
args = parser.parse_args("data/wikipedia/cooccurrence_FR.txt".split())#("data/geonamesData/FR.txt".split()) args = parser.parse_args()#("data/wikipedia/cooccurrence_FR.txt".split())#("data/geonamesData/FR.txt".split())
# LOAD DATAgeopandas # LOAD DATAgeopandas
COOC_FN = args.cooccurrence_file COOC_FN = args.cooccurrence_file
...@@ -82,4 +82,4 @@ del X_test["nn"] ...@@ -82,4 +82,4 @@ del X_test["nn"]
# SAVING THE DATA # SAVING THE DATA
logging.info("Saving Output !") logging.info("Saving Output !")
X_train.to_csv(COOC_FN+"_train.csv") X_train.to_csv(COOC_FN+"_train.csv")
X_test.to_csv(COOC_FN+"_test.csv") X_test.to_csv(COOC_FN+"_test.csv")
\ No newline at end of file
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment