Update Tutoriel-geoparsing.ipynb

d7ead727 · Ludovic Moncla · da3edf34 · d7ead727
Commit d7ead727 authored 2 years ago by Ludovic Moncla
--- a/Tutoriel-geoparsing.ipynb
+++ b/Tutoriel-geoparsing.ipynb
@@ -176,19 +176,28 @@
   "metadata": {},
   "source": [
    "## 5. Reconnaissance d'Entités Nommées (NER)\n",
-    "\n",
-    "\n",
-    "### 5.1 Stanza NER\n",
-    "\n",
-    "\n",
    "\n"
   ]
  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 5.1 Stanza NER"
+   ]
+  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
-    "* Importer la librairie `Stanza` et télécharger le modèles pré-entrainé pour le français : "
+    "* Importer la librairie `Stanza` et télécharger le modèle pré-entrainé pour le français : "
   ]
  },
  {
@@ -202,6 +211,13 @@
    "stanza.download('fr')"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Instancier et paramétrer la chaîne de traitement :"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -211,6 +227,13 @@
    "stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Executer la reconnaissance d'entités nommées :"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -220,6 +243,13 @@
    "doc = stanza_parser(content)"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Afficher la liste des entités nommées repérées :"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -341,12 +371,28 @@
    "### 5.3 Perdido Geoparser"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Instancier et paramétrer la chaîne de traitement :"
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
-   "source": []
+   "source": [
+    "geoparser = Geoparser(version=\"Encyclopedie\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Executer la reconnaissance d'entités nommées :"
+   ]
  },
  {
   "cell_type": "code",
@@ -354,13 +400,57 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "displacy.render(d['data'][1].to_spacy_doc(), style=\"ent\", jupyter=True) "
+    "doc = geoparser(content)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
-   "source": []
+   "source": [
+    "* Afficher la liste des entités nommées repérées :"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for ent in doc.named_entities:\n",
+    "    print(ent.text, ent.tag)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Afficher de manière graphique les entités nommées avec `displaCy` :"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "displacy.render(doc.to_spacy_doc(), style=\"ent\", jupyter=True)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Afficher de manière graphique les entités nommées étendues avec `displaCy` :"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "displacy.render(doc.to_spacy_doc(), style=\"span\", jupyter=True)"
+   ]
  },
  {
   "cell_type": "markdown",

 %% Cell type:markdown id: tags:
 ![CNRS](https://anf-tdm-2022.sciencesconf.org/data/header/LOGO_CNRS_CMJN_150x150.png)
 # Tutoriel - ANF TDM 2022 Python Geoparsing
 Supports pour l'atelier [Librairies Python et Services Web pour la reconnaissance d’entités nommées et la résolution de toponymes](https://anf-tdm-2022.sciencesconf.org/resource/page/id/11) de la formation CNRS [ANF TDM 2022](https://anf-tdm-2022.sciencesconf.org).
 ## 1. En bref
 Dans ce tutoriel, nous allons apprendre plusieurs choses :
 - Charger des jeu de données :
  - à partir de la librairie Python [Perdido](https://github.com/ludovicmoncla/perdido) dans un [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (articles encyclopédiques et descriptions de randonnées) ;
  - à partir de fichiers txt importés depuis le disque dur.
 - Manipuler et interroger un dataframe
 - Utiliser les librairies [Stanza](https://stanfordnlp.github.io/stanza/index.html), [spaCy](https://spacy.io) et [Perdido](https://github.com/ludovicmoncla/perdido) pour la reconnaissance d'entités nommées
  - afficher les entités nommées annotées ;
  - comparer les résultats de `Stanza`, `spaCy` et `Perdido` ;
  - discuter les limites des 3 outils pour la tâche de NER.
 - Utiliser la librarie `Perdido` pour le geoparsing :
  - cartographier les lieux geocodés ;
  - illustrer la problématique de désambiguïsation des toponymes.
 %% Cell type:markdown id: tags:
 ## 2. Introduction
 %% Cell type:markdown id: tags:
 ## 3. Configurer l'environnement
 ### 3.1 Installer les librairies Python
 * Si vous avez configuré votre environnement Conda en utilisant le fichier `requirements.txt`, vous pouvez sauter cette étape et aller à la section `3.2 Importer les librairies`.
 * Si vous avez configuré votre environnement Conda en utilisant le fichier `environment.yml` ou si vous utilisez un environnement Google Colab / Binder, vous devez installer `perdido` en utilisant `pip` :
 %% Cell type:code id: tags:
 ``` python
 !pip install --upgrade perdido
 ```
 %% Cell type:markdown id: tags:
 * Si vous avez déjà configuré votre environnement conda, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.
 * Si vous exécutez ce notebook depuis Google Colab / Binder, vous devez exécuter la cellule suivante :
 %% Cell type:code id: tags:
 ``` python
 !pip install stanza
 ```
 %% Cell type:markdown id: tags:
 ### 3.2 Importer les librairies
 Tout d'abord, nous allons charger certaines bibliothèques spécifiques de `Perdido` que nous utiliserons dans ce notebook. Ensuite, nous importons quelques outils qui nous aideront à analyser et à visualiser le texte.
 %% Cell type:code id: tags:
 ``` python
 import warnings
 warnings.filterwarnings('ignore')
 from perdido.geoparser import Geoparser
 from perdido.geocoder import Geocoder
 from perdido.datasets import load_edda_artfl, load_edda_perdido, load_choucas_perdido
 from spacy import displacy
 ```
 %% Cell type:markdown id: tags:
 ## 4. Chargement et exploration des données
 ### 4.1 Chargement d'un document texte à partir d'un fichier
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ### 4.2 Chargement d'un jeu de données à partir de la librairie Perdido
 Perdido embarque 2 jeux de données :
 1. articles encyclopédiques (volume 7 de l'Encyclopédie de Diderot et d'Alembert), fournit par l'ARTFL dans le cadre du projet GEODE.
 2. descriptions de randonnées (chaque description est associée à sa trace GPS. Elles proviennent du site visorando.fr et ont été collectées dans le cadre du projet ANR CHOUCAS.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 d = load_choucas_perdido()
 df = d['data'].to_dataframe()
 df.head()
 ```
 %% Cell type:markdown id: tags:
 ### 4.3 Manipulation d'un dataframe
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 ## 5. Reconnaissance d'Entités Nommées (NER)
-### 5.1 Stanza NER
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:markdown id: tags:
+### 5.1 Stanza NER
 %% Cell type:markdown id: tags:
-* Importer la librairie `Stanza` et télécharger le modèles pré-entrainé pour le français :
+* Importer la librairie `Stanza` et télécharger le modèle pré-entrainé pour le français :
 %% Cell type:code id: tags:
 ``` python
 import stanza
 stanza.download('fr')
 ```
+%% Cell type:markdown id: tags:
+* Instancier et paramétrer la chaîne de traitement :
 %% Cell type:code id: tags:
 ``` python
 stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')
 ```
+%% Cell type:markdown id: tags:
+* Executer la reconnaissance d'entités nommées :
 %% Cell type:code id: tags:
 ``` python
 doc = stanza_parser(content)
 ```
+%% Cell type:markdown id: tags:
+* Afficher la liste des entités nommées repérées :
 %% Cell type:code id: tags:
 ``` python
 for ent in doc.ents:
    print(ent.text, ent.type)
 ```
 %% Cell type:markdown id: tags:
 ### 5.2 SpaCy NER
 %% Cell type:markdown id: tags:
 * Installer le modèle français pré-entrainé de `spaCy` :
 %% Cell type:code id: tags:
 ``` python
 !python -m spacy download fr_core_news_sm
 ```
 %% Cell type:markdown id: tags:
 * Importer la librarie `spaCy` :
 %% Cell type:code id: tags:
 ``` python
 import spacy
 ```
 %% Cell type:markdown id: tags:
 * Charger le modèle français pré-entrainé de `spaCy`
 %% Cell type:code id: tags:
 ``` python
 spacy_parser = spacy.load('fr_core_news_sm')
 ```
 %% Cell type:markdown id: tags:
 * Executer la reconnaissance d'entités nommées :
 %% Cell type:code id: tags:
 ``` python
 doc = spacy_parser(content)
 ```
 %% Cell type:markdown id: tags:
 * Afficher la liste des entités nommées repérées :
 %% Cell type:code id: tags:
 ``` python
 for ent in doc.ents:
    print(ent.text, ent.label_)
 ```
 %% Cell type:markdown id: tags:
 * Afficher de manière graphique les entités nommées avec `displaCy` :
 %% Cell type:code id: tags:
 ``` python
 displacy.render(doc, style="ent", jupyter=True)
 ```
 %% Cell type:markdown id: tags:
 ### 5.3 Perdido Geoparser
+%% Cell type:markdown id: tags:
+* Instancier et paramétrer la chaîne de traitement :
 %% Cell type:code id: tags:
 ``` python
+geoparser = Geoparser(version="Encyclopedie")
 ```
+%% Cell type:markdown id: tags:
+* Executer la reconnaissance d'entités nommées :
 %% Cell type:code id: tags:
 ``` python
-displacy.render(d['data'][1].to_spacy_doc(), style="ent", jupyter=True)
+doc = geoparser(content)
 ```
 %% Cell type:markdown id: tags:
+* Afficher la liste des entités nommées repérées :
+%% Cell type:code id: tags:
+``` python
+for ent in doc.named_entities:
+    print(ent.text, ent.tag)
+```
+%% Cell type:markdown id: tags:
+* Afficher de manière graphique les entités nommées avec `displaCy` :
+%% Cell type:code id: tags:
+``` python
+displacy.render(doc.to_spacy_doc(), style="ent", jupyter=True)
+```
+%% Cell type:markdown id: tags:
+* Afficher de manière graphique les entités nommées étendues avec `displaCy` :
+%% Cell type:code id: tags:
+``` python
+displacy.render(doc.to_spacy_doc(), style="span", jupyter=True)
+```
 %% Cell type:markdown id: tags:
 ## 6. Geoparsing / Geocoding
 %% Cell type:code id: tags:
 ``` python
 # geocoding avec perdido
 ```
 %% Cell type:code id: tags:
 ``` python
 # afficher une carte
 d['data'][1].get_folium_map()
 ```
 %% Cell type:markdown id: tags:
 ### 6.2 Résolution de toponymes / désambiguïsation
 %% Cell type:markdown id: tags:
 Exemple de requetes sans stratégies de désambiguisation
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Délimiter un zone restreinte lors de la requête
 Premier niveau : utilisation d'un code pays.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Deuxième niveau : utilisation d'une bounding box délimitant la zone de recherche
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Désambiguisation basé sur la proximité géographique
 Clustering avec la méthode DBSCAN. Cette stratégie est adaptée pour une description d'itinéraire où les différents lieux cités doivent être localisés à proximité les uns des autres.
 %% Cell type:markdown id: tags:
 #### Résultats avant désambiguisation
 %% Cell type:code id: tags:
 ``` python
 d['data'][1].get_folium_map()
 ```
 %% Cell type:code id: tags:
 ``` python
 d['data'][1].cluster_disambiguation()
 ```
 %% Cell type:code id: tags:
 ``` python
 d['data'][1].get_folium_map()
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 Utilisation du contexte (autres entités nommées repérées dans le texte, relations spatiales, etc...). Développées dans le cadre du projet [Perdido]() (add ref 2014 et 2016) mais pas encore intégré à la librairie Python Perdido. Cette librairie est toujours en cours de développement et d'amélioration. Vos remarques et retours seront les bienvenues.
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:markdown id: tags:
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```
 %% Cell type:code id: tags:
 ``` python
 ```