Create Tutoriel-geoparsing.ipynb

2a4cc6df · Ludovic Moncla · bbbd1f3c · 2a4cc6df
Commit 2a4cc6df authored 2 years ago by Ludovic Moncla
--- a/Tutoriel-geoparsing.ipynb
+++ b/Tutoriel-geoparsing.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {
+    "colab_type": "text",
+    "id": "YrOKr9pwkxJw"
+   },
+   "source": [
+    "![CNRS](https://anf-tdm-2022.sciencesconf.org/data/header/LOGO_CNRS_CMJN_150x150.png)\n",
+    "\n",
+    "\n",
+    "# Tutoriel - ANF TDM 2022 Python Geoparsing \n",
+    "\n",
+    "Supports pour l'atelier [Librairies Python et Services Web pour la reconnaissance d’entités nommées et la résolution de toponymes](https://anf-tdm-2022.sciencesconf.org/resource/page/id/11) de la formation CNRS [ANF TDM 2022](https://anf-tdm-2022.sciencesconf.org).\n",
+    "\n",
+    "\n",
+    "## 1. En bref\n",
+    "\n",
+    "\n",
+    "Dans ce tutoriel, nous allons apprendre plusieurs choses :\n",
+    "\n",
+    "- Charger des jeu de données :\n",
+    "  - à partir de la librairie Python [Perdido](https://github.com/ludovicmoncla/perdido) dans un [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (articles encyclopédiques et descriptions de randonnées) ;\n",
+    "  - à partir de fichiers txt importés depuis le disque dur.\n",
+    "- Manipuler et interroger un dataframe\n",
+    "- Utiliser des librairies de reconnaissance d'entités nommées ([spaCy](https://spacy.io), [Stanza](https://stanfordnlp.github.io/stanza/index.html) et [Perdido](https://github.com/ludovicmoncla/perdido))\n",
+    "- Utiliser la librarie `Perdido` pour le geoparsing :\n",
+    "  - afficher les entités nommées annotées ;\n",
+    "  - cartographier les lieux geocodés.\n",
+    "- Comparer les résultats de`spaCy`, `Stanza` et `Perdido`\n",
+    "- Discuter les limites des 3 outils pour la tâche de NER\n",
+    "- Illustrer la problématique de désambiguïsation des toponymes"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 2. Introduction"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 2.1 spaCy\n",
+    "\n",
+    "\n",
+    "### 2.2 Stanza NER\n",
+    "\n",
+    "\n",
+    "### 2.3 Perdido Geoparser"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 3. Configurer l'environnement\n",
+    "\n",
+    "### 3.1 Installer les librairies Python\n",
+    "\n",
+    "* Si vous avez configuré votre environnement Conda en utilisant le fichier `requirements.txt`, vous pouvez sauter cette étape et aller à la section `3.2 Importer les librairies`.\n",
+    "* Si vous avez configuré votre environnement Conda en utilisant le fichier `environment.yml` ou si vous utilisez un environnement Google Colab, vous devez installer `perdido` en utilisant `pip` :"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install --upgrade perdido"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "* Si vous avez déjà configuré votre environnement conda, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.\n",
+    "* Si vous exécutez ce notebook depuis Google Colab, vous devez exécuter la cellule suivante :\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install stanza"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### 3.2 Importer les librairies\n",
+    "\n",
+    "\n",
+    "Tout d'abord, nous allons charger certaines bibliothèques spécifiques de `Perdido` que nous utiliserons dans ce notebook. Ensuite, nous importons quelques outils qui nous aideront à analyser et à visualiser le texte."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from perdido.geoparser import Geoparser\n",
+    "from perdido.geocoder import Geocoder\n",
+    "from perdido.datasets import load_edda_artfl, load_edda_perdido\n",
+    "\n",
+    "from spacy import displacy"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "metadata": {},
+   "outputs": [],
+   "source": []
+  }
+ ],
+ "metadata": {
+  "colab": {
+   "collapsed_sections": [],
+   "name": "Geoparsing.ipynb",
+   "provenance": [],
+   "toc_visible": true
+  },
+  "kernelspec": {
+   "display_name": "Python 3 (ipykernel)",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.13"
+  },
+  "vscode": {
+   "interpreter": {
+    "hash": "0706207376f2dbae316cdd2388509d85cbd8c808bf8c3cd698a4e4eda55d0086"
+   }
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}
+%% Cell type:markdown id: tags:
+![CNRS](https://anf-tdm-2022.sciencesconf.org/data/header/LOGO_CNRS_CMJN_150x150.png)
+# Tutoriel - ANF TDM 2022 Python Geoparsing
+Supports pour l'atelier [Librairies Python et Services Web pour la reconnaissance d’entités nommées et la résolution de toponymes](https://anf-tdm-2022.sciencesconf.org/resource/page/id/11) de la formation CNRS [ANF TDM 2022](https://anf-tdm-2022.sciencesconf.org).
+## 1. En bref
+Dans ce tutoriel, nous allons apprendre plusieurs choses :
+- Charger des jeu de données :
+  - à partir de la librairie Python [Perdido](https://github.com/ludovicmoncla/perdido) dans un [Pandas dataframe](https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.html) (articles encyclopédiques et descriptions de randonnées) ;
+  - à partir de fichiers txt importés depuis le disque dur.
+- Manipuler et interroger un dataframe
+- Utiliser des librairies de reconnaissance d'entités nommées ([spaCy](https://spacy.io), [Stanza](https://stanfordnlp.github.io/stanza/index.html) et [Perdido](https://github.com/ludovicmoncla/perdido))
+- Utiliser la librarie `Perdido` pour le geoparsing :
+  - afficher les entités nommées annotées ;
+  - cartographier les lieux geocodés.
+- Comparer les résultats de`spaCy`, `Stanza` et `Perdido`
+- Discuter les limites des 3 outils pour la tâche de NER
+- Illustrer la problématique de désambiguïsation des toponymes
+%% Cell type:markdown id: tags:
+## 2. Introduction
+%% Cell type:markdown id: tags:
+### 2.1 spaCy
+### 2.2 Stanza NER
+### 2.3 Perdido Geoparser
+%% Cell type:markdown id: tags:
+## 3. Configurer l'environnement
+### 3.1 Installer les librairies Python
+* Si vous avez configuré votre environnement Conda en utilisant le fichier `requirements.txt`, vous pouvez sauter cette étape et aller à la section `3.2 Importer les librairies`.
+* Si vous avez configuré votre environnement Conda en utilisant le fichier `environment.yml` ou si vous utilisez un environnement Google Colab, vous devez installer `perdido` en utilisant `pip` :
+%% Cell type:code id: tags:
+``` python
+!pip install --upgrade perdido
+```
+%% Cell type:markdown id: tags:
+* Si vous avez déjà configuré votre environnement conda, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.
+* Si vous exécutez ce notebook depuis Google Colab, vous devez exécuter la cellule suivante :
+%% Cell type:code id: tags:
+``` python
+!pip install stanza
+```
+%% Cell type:markdown id: tags:
+### 3.2 Importer les librairies
+Tout d'abord, nous allons charger certaines bibliothèques spécifiques de `Perdido` que nous utiliserons dans ce notebook. Ensuite, nous importons quelques outils qui nous aideront à analyser et à visualiser le texte.
+%% Cell type:code id: tags:
+``` python
+from perdido.geoparser import Geoparser
+from perdido.geocoder import Geocoder
+from perdido.datasets import load_edda_artfl, load_edda_perdido
+from spacy import displacy
+```
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:code id: tags:
+``` python
+```
+%% Cell type:code id: tags:
+``` python
+```