{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {
    "colab_type": "text",
    "id": "YrOKr9pwkxJw"
   },
   "source": [
    "![CNRS](https://anf-tdm-2022.sciencesconf.org/data/header/LOGO_CNRS_CMJN_150x150.png)\n",
    "\n",
    "\n",
    "# Tutoriel - ANF TDM 2022 Python Geoparsing \n",
    "\n",
    "Supports pour l'atelier [Librairies Python et Services Web pour la reconnaissance d’entités nommées et la résolution de toponymes](https://anf-tdm-2022.sciencesconf.org/resource/page/id/11) de la formation CNRS [ANF TDM 2022](https://anf-tdm-2022.sciencesconf.org).\n",
    "\n",
    "\n",
    "## 1. En bref\n",
    "\n",
    "\n",
    "Dans ce tutoriel, nous allons apprendre plusieurs choses :\n",
    "\n",
    "- Charger des jeux de données :\n",
    "  - à partir de fichiers txt importés depuis le disque dur ;\n",
    "  - à partir de la librairie Python [Perdido](https://github.com/ludovicmoncla/perdido) dans un [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (articles encyclopédiques et descriptions de randonnées).\n",
    "- Manipuler et interroger un dataframe\n",
    "- Utiliser les librairies [Stanza](https://stanfordnlp.github.io/stanza/index.html), [spaCy](https://spacy.io) et [Perdido](https://github.com/ludovicmoncla/perdido) pour la reconnaissance d'entités nommées\n",
    "  - afficher les entités nommées annotées ;\n",
    "  - comparer les résultats de `Stanza`, `spaCy` et `Perdido` ;\n",
    "  - discuter les limites des 3 outils pour la tâche de NER.\n",
    "- Utiliser la librarie `Perdido` pour le geoparsing :\n",
    "  - cartographier les lieux geocodés ;\n",
    "  - illustrer la problématique de désambiguïsation des toponymes."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2. Introduction"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3. Configurer l'environnement\n",
    "\n",
    "### 3.1 Installer les librairies Python\n",
    "\n",
    "* Si vous avez configuré votre environnement Conda en utilisant le fichier `requirements.txt`, vous pouvez sauter cette étape et aller à la section `3.2 Importer les librairies`.\n",
    "* Si vous avez configuré votre environnement Conda en utilisant le fichier `environment.yml` ou si vous utilisez un environnement Google Colab / Binder, vous devez installer `perdido` en utilisant `pip` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install --upgrade perdido"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Si vous avez déjà configuré votre environnement conda, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.\n",
    "* Si vous exécutez ce notebook depuis Google Colab / Binder, vous devez exécuter la cellule suivante :\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!pip install stanza"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 Importer les librairies\n",
    "\n",
    "\n",
    "Tout d'abord, nous allons charger certaines bibliothèques spécifiques de `Perdido` que nous utiliserons dans ce notebook. Ensuite, nous importons quelques outils qui nous aideront à analyser et à visualiser le texte."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "metadata": {},
   "outputs": [
    {
     "ename": "ImportError",
     "evalue": "cannot import name 'load_choucas_perdido' from 'perdido.datasets' (/opt/homebrew/Caskroom/miniforge/base/envs/sunoikisis-py39/lib/python3.9/site-packages/perdido/datasets.py)",
     "output_type": "error",
     "traceback": [
      "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
      "\u001b[0;31mImportError\u001b[0m                               Traceback (most recent call last)",
      "\u001b[1;32m/Users/lmoncla/git/gitlab.liris/lmoncla/tutoriel-anf-tdm-2022-python-geoparsing/Tutoriel-geoparsing.ipynb Cellule 8\u001b[0m in \u001b[0;36m<cell line: 7>\u001b[0;34m()\u001b[0m\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/lmoncla/git/gitlab.liris/lmoncla/tutoriel-anf-tdm-2022-python-geoparsing/Tutoriel-geoparsing.ipynb#X10sZmlsZQ%3D%3D?line=3'>4</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mperdido\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mgeoparser\u001b[39;00m \u001b[39mimport\u001b[39;00m Geoparser\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/lmoncla/git/gitlab.liris/lmoncla/tutoriel-anf-tdm-2022-python-geoparsing/Tutoriel-geoparsing.ipynb#X10sZmlsZQ%3D%3D?line=4'>5</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mperdido\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mgeocoder\u001b[39;00m \u001b[39mimport\u001b[39;00m Geocoder\n\u001b[0;32m----> <a href='vscode-notebook-cell:/Users/lmoncla/git/gitlab.liris/lmoncla/tutoriel-anf-tdm-2022-python-geoparsing/Tutoriel-geoparsing.ipynb#X10sZmlsZQ%3D%3D?line=6'>7</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mperdido\u001b[39;00m\u001b[39m.\u001b[39;00m\u001b[39mdatasets\u001b[39;00m \u001b[39mimport\u001b[39;00m load_edda_artfl, load_edda_perdido, load_choucas_perdido\n\u001b[1;32m      <a href='vscode-notebook-cell:/Users/lmoncla/git/gitlab.liris/lmoncla/tutoriel-anf-tdm-2022-python-geoparsing/Tutoriel-geoparsing.ipynb#X10sZmlsZQ%3D%3D?line=8'>9</a>\u001b[0m \u001b[39mfrom\u001b[39;00m \u001b[39mspacy\u001b[39;00m \u001b[39mimport\u001b[39;00m displacy\n",
      "\u001b[0;31mImportError\u001b[0m: cannot import name 'load_choucas_perdido' from 'perdido.datasets' (/opt/homebrew/Caskroom/miniforge/base/envs/sunoikisis-py39/lib/python3.9/site-packages/perdido/datasets.py)"
     ]
    }
   ],
   "source": [
    "import warnings\n",
    "warnings.filterwarnings('ignore')\n",
    "\n",
    "from perdido.geoparser import Geoparser\n",
    "from perdido.geocoder import Geocoder\n",
    "\n",
    "from perdido.datasets import load_edda_artfl, load_edda_perdido, load_choucas_perdido\n",
    "\n",
    "from spacy import displacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4. Chargement et exploration des données\n",
    "\n",
    "### 4.1 Chargement d'un document texte à partir d'un fichier\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "metadata": {},
   "outputs": [],
   "source": [
    "filepath = 'data/volume01-4083.txt'\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "* ARQUES, (Géog.) petite ville de France, en Normandie, au pays de Caux, sur la petite riviere d'Arques. Long. 18. 50. lat. 49. 54.\n"
     ]
    }
   ],
   "source": [
    "with open(filepath) as f:\n",
    "    content = f.read()\n",
    "    print(content)\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 Chargement d'un jeu de données à partir de la librairie Perdido\n",
    "\n",
    "Perdido embarque 2 jeux de données : \n",
    " 1. articles encyclopédiques (volume 7 de l'Encyclopédie de Diderot et d'Alembert), fournit par l'ARTFL dans le cadre du projet GEODE.\n",
    " 2. descriptions de randonnées (chaque description est associée à sa trace GPS. Elles proviennent du site visorando.fr et ont été collectées dans le cadre du projet ANR CHOUCAS."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d = load_choucas_perdido()\n",
    "df = d['data'].to_dataframe()\n",
    "df.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 Manipulation d'un dataframe"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5. Reconnaissance d'Entités Nommées (NER)\n",
    "\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.1 Stanza NER"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Importer la librairie `Stanza` et télécharger le modèle pré-entrainé pour le français : "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import stanza\n",
    "\n",
    "stanza.download('fr')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Instancier et paramétrer la chaîne de traitement :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Executer la reconnaissance d'entités nommées :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = stanza_parser(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Afficher la liste des entités nommées repérées :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.type)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 SpaCy NER"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Installer le modèle français pré-entrainé de `spaCy` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "!python -m spacy download fr_core_news_sm"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Importer la librarie `spaCy` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import spacy"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Charger le modèle français pré-entrainé de `spaCy`"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "spacy_parser = spacy.load('fr_core_news_sm')"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Executer la reconnaissance d'entités nommées :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = spacy_parser(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Afficher la liste des entités nommées repérées :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for ent in doc.ents:\n",
    "    print(ent.text, ent.label_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Afficher de manière graphique les entités nommées avec `displaCy` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "displacy.render(doc, style=\"ent\", jupyter=True) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.3 Perdido Geoparser"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Instancier et paramétrer la chaîne de traitement :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "geoparser = Geoparser(version=\"Encyclopedie\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Executer la reconnaissance d'entités nommées :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "doc = geoparser(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Afficher la liste des entités nommées repérées :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for ent in doc.named_entities:\n",
    "    print(ent.text, ent.tag)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Afficher de manière graphique les entités nommées avec `displaCy` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "displacy.render(doc.to_spacy_doc(), style=\"ent\", jupyter=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "* Afficher de manière graphique les entités nommées étendues avec `displaCy` :"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "displacy.render(doc.to_spacy_doc(), style=\"span\", jupyter=True)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.4 Expérimentations et comparaison"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6. Geoparsing / Geocoding"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# geocoding avec perdido"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# afficher une carte\n",
    "d['data'][1].get_folium_map()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 6.2 Résolution de toponymes / désambiguïsation\n",
    "\n",
    "\n",
    "\n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Exemple de requetes sans stratégies de désambiguisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Délimiter un zone restreinte lors de la requête\n",
    "\n",
    "Premier niveau : utilisation d'un code pays."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Deuxième niveau : utilisation d'une bounding box délimitant la zone de recherche"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Désambiguisation basé sur la proximité géographique\n",
    "\n",
    "Clustering avec la méthode DBSCAN. Cette stratégie est adaptée pour une description d'itinéraire où les différents lieux cités doivent être localisés à proximité les uns des autres."
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "#### Résultats avant désambiguisation"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['data'][1].get_folium_map()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['data'][1].cluster_disambiguation()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "d['data'][1].get_folium_map()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "Utilisation du contexte (autres entités nommées repérées dans le texte, relations spatiales, etc...). Développées dans le cadre du projet [Perdido]() (add ref 2014 et 2016) mais pas encore intégré à la librairie Python Perdido. Cette librairie est toujours en cours de développement et d'amélioration. Vos remarques et retours seront les bienvenues."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": []
  }
 ],
 "metadata": {
  "colab": {
   "collapsed_sections": [],
   "name": "Geoparsing.ipynb",
   "provenance": [],
   "toc_visible": true
  },
  "kernelspec": {
   "display_name": "Python 3.9.13 ('tdm-geoparsing-py39')",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.13"
  },
  "vscode": {
   "interpreter": {
    "hash": "ac6d00d97418b1a9db80d25edbc60018a184534aba90cc2f485de2791198ec07"
   }
  }
 },
 "nbformat": 4,
 "nbformat_minor": 2
}