Update Tutoriel-geoparsing.ipynb

1b14fa59 · Ludovic Moncla · 7e3a4910 · 1b14fa59
Commit 1b14fa59 authored 2 years ago by Ludovic Moncla
--- a/Tutoriel-geoparsing.ipynb
+++ b/Tutoriel-geoparsing.ipynb
@@ -24,13 +24,13 @@
    "  - à partir de la librairie Python [Perdido](https://github.com/ludovicmoncla/perdido) dans un [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (articles encyclopédiques et descriptions de randonnées) ;\n",
    "  - à partir de fichiers txt importés depuis le disque dur.\n",
    "- Manipuler et interroger un dataframe\n",
-    "- Utiliser des librairies de reconnaissance d'entités nommées ([spaCy](https://spacy.io), [Stanza](https://stanfordnlp.github.io/stanza/index.html) et [Perdido](https://github.com/ludovicmoncla/perdido))\n",
-    "- Utiliser la librarie `Perdido` pour le geoparsing :\n",
+    "- Utiliser les librairies [Stanza](https://stanfordnlp.github.io/stanza/index.html), [spaCy](https://spacy.io) et [Perdido](https://github.com/ludovicmoncla/perdido) pour la reconnaissance d'entités nommées\n",
    "  - afficher les entités nommées annotées ;\n",
-    "  - cartographier les lieux geocodés.\n",
-    "- Comparer les résultats de`spaCy`, `Stanza` et `Perdido`\n",
-    "- Discuter les limites des 3 outils pour la tâche de NER\n",
-    "- Illustrer la problématique de désambiguïsation des toponymes"
+    "  - comparer les résultats de `Stanza`, `spaCy` et `Perdido` ;\n",
+    "  - discuter les limites des 3 outils pour la tâche de NER.\n",
+    "- Utiliser la librarie `Perdido` pour le geoparsing :\n",
+    "  - cartographier les lieux geocodés ;\n",
+    "  - illustrer la problématique de désambiguïsation des toponymes."
   ]
  },
  {
@@ -40,19 +40,6 @@
    "## 2. Introduction"
   ]
  },
-  {
-   "cell_type": "markdown",
-   "metadata": {},
-   "source": [
-    "### 2.1 spaCy\n",
-    "\n",
-    "\n",
-    "### 2.2 Stanza NER\n",
-    "\n",
-    "\n",
-    "### 2.3 Perdido Geoparser"
-   ]
-  },
  {
   "cell_type": "markdown",
   "metadata": {},
@@ -107,13 +94,253 @@
   "metadata": {},
   "outputs": [],
   "source": [
+    "import warnings\n",
+    "warnings.filterwarnings('ignore')\n",
+    "\n",
    "from perdido.geoparser import Geoparser\n",
    "from perdido.geocoder import Geocoder\n",
-    "from perdido.datasets import load_edda_artfl, load_edda_perdido\n",
+    "\n",
+    "from perdido.datasets import load_edda_artfl, load_edda_perdido, load_choucas_perdido\n",
    "\n",
    "from spacy import displacy"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 4. Chargement et exploration des données\n",
+    "\n",
+    "### 4.1 Chargement d'un document texte à partir d'un fichier\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4.2 Chargement d'un jeu de données à partir de la librairie Perdido\n",
+    "\n",
+    "Perdido embarque 2 jeux de données : \n",
+    " 1. articles encyclopédiques (volume 7 de l'Encyclopédie de Diderot et d'Alembert), fournit par l'ARTFL dans le cadre du projet GEODE.\n",
+    " 2. descriptions de randonnées (chaque description est associée à sa trace GPS. Elles proviennent du site visorando.fr et ont été collectées dans le cadre du projet ANR CHOUCAS."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d = load_choucas_perdido()\n",
+    "df = d['data'].to_dataframe()\n",
+    "df.head()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 4.3 Manipulation d'un dataframe"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 5. Reconnaissance d'Entités Nommées (NER)\n",
+    "\n",
+    "\n",
+    "### 5.1 Stanza NER\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.2 SpaCy NER"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.3 Perdido Geoparser"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "displacy.render(d['data'][1].to_spacy_doc(), style=\"ent\", jupyter=True) "
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 6. Geoparsing / Geocoding"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# geocoding avec perdido"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "# afficher une carte\n",
+    "d['data'][1].get_folium_map()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 6.2 Résolution de toponymes / désambiguïsation\n",
+    "\n",
+    "\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Exemple de requetes sans stratégies de désambiguisation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Délimiter un zone restreinte lors de la requête\n",
+    "\n",
+    "Premier niveau : utilisation d'un code pays."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Deuxième niveau : utilisation d'une bounding box délimitant la zone de recherche"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -121,6 +348,49 @@
   "outputs": [],
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Désambiguisation basé sur la proximité géographique\n",
+    "\n",
+    "Clustering avec la méthode DBSCAN. Cette stratégie est adaptée pour une description d'itinéraire où les différents lieux cités doivent être localisés à proximité les uns des autres."
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "#### Résultats avant désambiguisation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d['data'][1].get_folium_map()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d['data'][1].cluster_disambiguation()"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "d['data'][1].get_folium_map()"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -128,6 +398,13 @@
   "outputs": [],
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Utilisation du contexte (autres entités nommées repérées dans le texte, relations spatiales, etc...). Développées dans le cadre du projet [Perdido]() (add ref 2014 et 2016) mais pas encore intégré à la librairie Python Perdido. Cette librairie est toujours en cours de développement et d'amélioration. Vos remarques et retours seront les bienvenues."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -135,6 +412,11 @@
   "outputs": [],
   "source": []
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": []
+  },
  {
   "cell_type": "code",
   "execution_count": null,

 %% Cell type:markdown id: tags:

 ![CNRS](https://anf-tdm-2022.sciencesconf.org/data/header/LOGO_CNRS_CMJN_150x150.png)


 # Tutoriel - ANF TDM 2022 Python Geoparsing

 Supports pour l'atelier [Librairies Python et Services Web pour la reconnaissance d’entités nommées et la résolution de toponymes](https://anf-tdm-2022.sciencesconf.org/resource/page/id/11) de la formation CNRS [ANF TDM 2022](https://anf-tdm-2022.sciencesconf.org).


 ## 1. En bref


 Dans ce tutoriel, nous allons apprendre plusieurs choses :

 - Charger des jeu de données :
  - à partir de la librairie Python [Perdido](https://github.com/ludovicmoncla/perdido) dans un [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (articles encyclopédiques et descriptions de randonnées) ;
  - à partir de fichiers txt importés depuis le disque dur.
 - Manipuler et interroger un dataframe
- Utiliser des librairies de reconnaissance d'entités nommées ([spaCy](https://spacy.io), [Stanza](https://stanfordnlp.github.io/stanza/index.html) et [Perdido](https://github.com/ludovicmoncla/perdido))
- Utiliser la librarie `Perdido` pour le geoparsing :
+- Utiliser les librairies [Stanza](https://stanfordnlp.github.io/stanza/index.html), [spaCy](https://spacy.io) et [Perdido](https://github.com/ludovicmoncla/perdido) pour la reconnaissance d'entités nommées
  - afficher les entités nommées annotées ;
-  - cartographier les lieux geocodés.
- Comparer les résultats de`spaCy`, `Stanza` et `Perdido`
- Discuter les limites des 3 outils pour la tâche de NER
- Illustrer la problématique de désambiguïsation des toponymes
+  - comparer les résultats de `Stanza`, `spaCy` et `Perdido` ;
+  - discuter les limites des 3 outils pour la tâche de NER.
+- Utiliser la librarie `Perdido` pour le geoparsing :
+  - cartographier les lieux geocodés ;
+  - illustrer la problématique de désambiguïsation des toponymes.

 %% Cell type:markdown id: tags:

 ## 2. Introduction

 %% Cell type:markdown id: tags:

-### 2.1 spaCy
-
-
-### 2.2 Stanza NER
-
-
-### 2.3 Perdido Geoparser
-
-%% Cell type:markdown id: tags:
-
 ## 3. Configurer l'environnement

 ### 3.1 Installer les librairies Python

 * Si vous avez configuré votre environnement Conda en utilisant le fichier `requirements.txt`, vous pouvez sauter cette étape et aller à la section `3.2 Importer les librairies`.
 * Si vous avez configuré votre environnement Conda en utilisant le fichier `environment.yml` ou si vous utilisez un environnement Google Colab / Binder, vous devez installer `perdido` en utilisant `pip` :

 %% Cell type:code id: tags:

 ``` python
 !pip install --upgrade perdido
 ```

 %% Cell type:markdown id: tags:

 * Si vous avez déjà configuré votre environnement conda, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.
 * Si vous exécutez ce notebook depuis Google Colab / Binder, vous devez exécuter la cellule suivante :

 %% Cell type:code id: tags:

 ``` python
 !pip install stanza
 ```

 %% Cell type:markdown id: tags:

 ### 3.2 Importer les librairies


 Tout d'abord, nous allons charger certaines bibliothèques spécifiques de `Perdido` que nous utiliserons dans ce notebook. Ensuite, nous importons quelques outils qui nous aideront à analyser et à visualiser le texte.

 %% Cell type:code id: tags:

 ``` python
+import warnings
+warnings.filterwarnings('ignore')
+
 from perdido.geoparser import Geoparser
 from perdido.geocoder import Geocoder
-from perdido.datasets import load_edda_artfl, load_edda_perdido
+
+from perdido.datasets import load_edda_artfl, load_edda_perdido, load_choucas_perdido

 from spacy import displacy
 ```

+%% Cell type:markdown id: tags:
+
+## 4. Chargement et exploration des données
+
+### 4.1 Chargement d'un document texte à partir d'un fichier
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+### 4.2 Chargement d'un jeu de données à partir de la librairie Perdido
+
+Perdido embarque 2 jeux de données :
+ 1. articles encyclopédiques (volume 7 de l'Encyclopédie de Diderot et d'Alembert), fournit par l'ARTFL dans le cadre du projet GEODE.
+ 2. descriptions de randonnées (chaque description est associée à sa trace GPS. Elles proviennent du site visorando.fr et ont été collectées dans le cadre du projet ANR CHOUCAS.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+d = load_choucas_perdido()
+df = d['data'].to_dataframe()
+df.head()
+```
+
+%% Cell type:markdown id: tags:
+
+### 4.3 Manipulation d'un dataframe
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+## 5. Reconnaissance d'Entités Nommées (NER)
+
+
+### 5.1 Stanza NER
+
+
+
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+### 5.2 SpaCy NER
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+### 5.3 Perdido Geoparser
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+displacy.render(d['data'][1].to_spacy_doc(), style="ent", jupyter=True)
+```
+
+%% Cell type:markdown id: tags:
+
+
+%% Cell type:markdown id: tags:
+
+## 6. Geoparsing / Geocoding
+
+%% Cell type:code id: tags:
+
+``` python
+# geocoding avec perdido
+```
+
+%% Cell type:code id: tags:
+
+``` python
+# afficher une carte
+d['data'][1].get_folium_map()
+```
+
+%% Cell type:markdown id: tags:
+
+### 6.2 Résolution de toponymes / désambiguïsation
+
+
+
+
+%% Cell type:markdown id: tags:
+
+Exemple de requetes sans stratégies de désambiguisation
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+Délimiter un zone restreinte lors de la requête
+
+Premier niveau : utilisation d'un code pays.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+Deuxième niveau : utilisation d'une bounding box délimitant la zone de recherche
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+Désambiguisation basé sur la proximité géographique
+
+Clustering avec la méthode DBSCAN. Cette stratégie est adaptée pour une description d'itinéraire où les différents lieux cités doivent être localisés à proximité les uns des autres.
+
+%% Cell type:markdown id: tags:
+
+#### Résultats avant désambiguisation
+
+%% Cell type:code id: tags:
+
+``` python
+d['data'][1].get_folium_map()
+```
+
 %% Cell type:code id: tags:

 ``` python
+d['data'][1].cluster_disambiguation()
 ```

 %% Cell type:code id: tags:

 ``` python
+d['data'][1].get_folium_map()
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

+%% Cell type:markdown id: tags:
+
+Utilisation du contexte (autres entités nommées repérées dans le texte, relations spatiales, etc...). Développées dans le cadre du projet [Perdido]() (add ref 2014 et 2016) mais pas encore intégré à la librairie Python Perdido. Cette librairie est toujours en cours de développement et d'amélioration. Vos remarques et retours seront les bienvenues.
+
+%% Cell type:code id: tags:
+
+``` python
+```
+
+%% Cell type:markdown id: tags:
+
+
 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```

 %% Cell type:code id: tags:

 ``` python
 ```