diff --git a/Tutoriel-geoparsing.ipynb b/Tutoriel-geoparsing.ipynb index b41b7620f3d61f8927d4142b488afbb091317250..866711c516465a1741a7089628c5a0fab46bd053 100644 --- a/Tutoriel-geoparsing.ipynb +++ b/Tutoriel-geoparsing.ipynb @@ -42,25 +42,15 @@ "\n", "### 2.1 Installer les librairies Python\n", "\n", - "* Si vous avez configuré votre environnement Conda en utilisant le fichier `requirements.txt`, vous pouvez sauter cette étape et aller à la section `3.2 Importer les librairies`.\n", - "* Si vous avez configuré votre environnement Conda en utilisant le fichier `environment.yml` ou si vous utilisez un environnement Google Colab / Binder, vous devez installer `perdido` en utilisant `pip` :" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "!pip install perdido" + "* Si vous avez déjà configuré votre environnement, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.\n", + "* Si vous exécutez ce notebook depuis Google Colab / Binder, vous devez exécuter la cellule suivante :" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ - "* Si vous avez déjà configuré votre environnement conda, soit avec conda, soit avec pip (voir le fichier readme), vous pouvez ignorer la cellule suivante.\n", - "* Si vous exécutez ce notebook depuis Google Colab / Binder, vous devez exécuter la cellule suivante :\n" + "\n" ] }, { @@ -69,6 +59,7 @@ "metadata": {}, "outputs": [], "source": [ + "!pip install perdido\n", "!pip install stanza" ] }, @@ -84,7 +75,7 @@ }, { "cell_type": "code", - "execution_count": 1, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -110,7 +101,7 @@ }, { "cell_type": "code", - "execution_count": 2, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -123,12 +114,12 @@ }, { "cell_type": "code", - "execution_count": 3, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# On utilise la fonction précédente pour récupérer le contenu de l'article encyclopédique 'Arques' (volume01-4083.txt) présent dans le dossier data\n", - "arques = load_txt('data/volume01-4083.txt')" + "arques = load_txt('data/edda-volume01-4083.txt')" ] }, { @@ -140,17 +131,9 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* ARQUES, (Géog.) petite ville de France, en Normandie, au pays de Caux, sur la petite riviere d'Arques. Long. 18. 50. lat. 49. 54.\n" - ] - } - ], + "outputs": [], "source": [ "print(arques)" ] @@ -177,7 +160,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -194,30 +177,9 @@ }, { "cell_type": "code", - "execution_count": 6, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "<class 'pandas.core.frame.DataFrame'>\n", - "RangeIndex: 3385 entries, 0 to 3384\n", - "Data columns (total 7 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 filename 3385 non-null object\n", - " 1 volume 3385 non-null int64 \n", - " 2 number 3385 non-null int64 \n", - " 3 head 3384 non-null object\n", - " 4 normClass 3384 non-null object\n", - " 5 author 3384 non-null object\n", - " 6 text 3385 non-null object\n", - "dtypes: int64(2), object(5)\n", - "memory usage: 185.2+ KB\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "data_artfl.info()" ] @@ -233,7 +195,7 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -249,30 +211,9 @@ }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "<class 'pandas.core.frame.DataFrame'>\n", - "Int64Index: 3384 entries, 0 to 3384\n", - "Data columns (total 7 columns):\n", - " # Column Non-Null Count Dtype \n", - "--- ------ -------------- ----- \n", - " 0 filename 3384 non-null object\n", - " 1 volume 3384 non-null int64 \n", - " 2 number 3384 non-null int64 \n", - " 3 head 3384 non-null object\n", - " 4 normClass 3384 non-null object\n", - " 5 author 3384 non-null object\n", - " 6 text 3384 non-null object\n", - "dtypes: int64(2), object(5)\n", - "memory usage: 211.5+ KB\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "data_artfl.info()" ] @@ -286,115 +227,9 @@ }, { "cell_type": "code", - "execution_count": 9, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<div>\n", - "<style scoped>\n", - " .dataframe tbody tr th:only-of-type {\n", - " vertical-align: middle;\n", - " }\n", - "\n", - " .dataframe tbody tr th {\n", - " vertical-align: top;\n", - " }\n", - "\n", - " .dataframe thead th {\n", - " text-align: right;\n", - " }\n", - "</style>\n", - "<table border=\"1\" class=\"dataframe\">\n", - " <thead>\n", - " <tr style=\"text-align: right;\">\n", - " <th></th>\n", - " <th>filename</th>\n", - " <th>volume</th>\n", - " <th>number</th>\n", - " <th>head</th>\n", - " <th>normClass</th>\n", - " <th>author</th>\n", - " <th>text</th>\n", - " </tr>\n", - " </thead>\n", - " <tbody>\n", - " <tr>\n", - " <th>0</th>\n", - " <td>volume07-1.tei</td>\n", - " <td>7</td>\n", - " <td>1</td>\n", - " <td>Title Page</td>\n", - " <td>unclassified</td>\n", - " <td>unsigned</td>\n", - " <td>ENCYCLOPÉDIE, ou DICTIONNAIRE RAISONNÉ DES SCI...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>1</th>\n", - " <td>volume07-10.tei</td>\n", - " <td>7</td>\n", - " <td>10</td>\n", - " <td>FOESNE ou FOUANE</td>\n", - " <td>Marine | Pêche</td>\n", - " <td>Bellin</td>\n", - " <td>FOESNE ou FOUANE, sub. s. (Marine & Pêche.) c'...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>2</th>\n", - " <td>volume07-100.tei</td>\n", - " <td>7</td>\n", - " <td>100</td>\n", - " <td>Fond de la hune</td>\n", - " <td>unclassified</td>\n", - " <td>Bellin</td>\n", - " <td>Fond de la hune ; ce sont les planches qu on p...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>3</th>\n", - " <td>volume07-1000.tei</td>\n", - " <td>7</td>\n", - " <td>1000</td>\n", - " <td>Fronteau</td>\n", - " <td>Bourrelier | Sellier</td>\n", - " <td>Diderot</td>\n", - " <td>* Fronteau, terme de Sellier-Bourrelier ; c'es...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>4</th>\n", - " <td>volume07-1001.tei</td>\n", - " <td>7</td>\n", - " <td>1001</td>\n", - " <td>FRONTIERE</td>\n", - " <td>Géographie</td>\n", - " <td>Diderot</td>\n", - " <td>* FRONTIERE, s. f. (Géog.) se dit des limites,...</td>\n", - " </tr>\n", - " </tbody>\n", - "</table>\n", - "</div>" - ], - "text/plain": [ - " filename volume number head normClass \\\n", - "0 volume07-1.tei 7 1 Title Page unclassified \n", - "1 volume07-10.tei 7 10 FOESNE ou FOUANE Marine | Pêche \n", - "2 volume07-100.tei 7 100 Fond de la hune unclassified \n", - "3 volume07-1000.tei 7 1000 Fronteau Bourrelier | Sellier \n", - "4 volume07-1001.tei 7 1001 FRONTIERE Géographie \n", - "\n", - " author text \n", - "0 unsigned ENCYCLOPÉDIE, ou DICTIONNAIRE RAISONNÉ DES SCI... \n", - "1 Bellin FOESNE ou FOUANE, sub. s. (Marine & Pêche.) c'... \n", - "2 Bellin Fond de la hune ; ce sont les planches qu on p... \n", - "3 Diderot * Fronteau, terme de Sellier-Bourrelier ; c'es... \n", - "4 Diderot * FRONTIERE, s. f. (Géog.) se dit des limites,... " - ] - }, - "execution_count": 9, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "data_artfl.head()" ] @@ -415,17 +250,9 @@ }, { "cell_type": "code", - "execution_count": 10, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Il y a 3384 articles dans le jeu de données.\n" - ] - } - ], + "outputs": [], "source": [ "n = data_artfl.shape[0]\n", "print('Il y a ' + str(n) + ' articles dans le jeu de données.')" @@ -453,67 +280,9 @@ }, { "cell_type": "code", - "execution_count": 11, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<div>\n", - "<style scoped>\n", - " .dataframe tbody tr th:only-of-type {\n", - " vertical-align: middle;\n", - " }\n", - "\n", - " .dataframe tbody tr th {\n", - " vertical-align: top;\n", - " }\n", - "\n", - " .dataframe thead th {\n", - " text-align: right;\n", - " }\n", - "</style>\n", - "<table border=\"1\" class=\"dataframe\">\n", - " <thead>\n", - " <tr style=\"text-align: right;\">\n", - " <th></th>\n", - " <th>filename</th>\n", - " <th>volume</th>\n", - " <th>number</th>\n", - " <th>head</th>\n", - " <th>normClass</th>\n", - " <th>author</th>\n", - " <th>text</th>\n", - " </tr>\n", - " </thead>\n", - " <tbody>\n", - " <tr>\n", - " <th>5</th>\n", - " <td>volume07-1002.tei</td>\n", - " <td>7</td>\n", - " <td>1002</td>\n", - " <td>FRONTIGNAN</td>\n", - " <td>Géographie</td>\n", - " <td>Jaucourt</td>\n", - " <td>FRONTIGNAN, (Géog.) petite ville de France. au...</td>\n", - " </tr>\n", - " </tbody>\n", - "</table>\n", - "</div>" - ], - "text/plain": [ - " filename volume number head normClass author \\\n", - "5 volume07-1002.tei 7 1002 FRONTIGNAN Géographie Jaucourt \n", - "\n", - " text \n", - "5 FRONTIGNAN, (Géog.) petite ville de France. au... " - ] - }, - "execution_count": 11, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "frontignan = data_artfl.loc[data_artfl['head'] == 'FRONTIGNAN']\n", "frontignan" @@ -528,19 +297,9 @@ }, { "cell_type": "code", - "execution_count": 12, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "volume : 7\n", - "number : 1002\n", - "text : FRONTIGNAN, (Géog.) petite ville de France. au Bas-Languedoc, connue par ses excellens vins muscats, & ses raisins de caisse qu'on appelle passerilles. Quelques savans croyent, sans en donner de preuves, que cette ville est le forum Domitii des Romains. Elle est située sur l'étang de Maguelone, à six lieues N. E. d'Agde, & cinq S. O. de Montpellier. Long. 15d. 24'. lat. 43d. 28'. (D. J.)\n" - ] - } - ], + "outputs": [], "source": [ "print('volume :', frontignan.volume.item()) # similaire à frontignan['volume'].item()\n", "print('number :', frontignan.number.item())\n", @@ -558,17 +317,9 @@ }, { "cell_type": "code", - "execution_count": 13, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "698 articles ont été rédigés par Jaucourt\n" - ] - } - ], + "outputs": [], "source": [ "req = 'Jaucourt'\n", "d_Jaucourt = data_artfl.loc[data_artfl['author'] == req]\n", @@ -586,122 +337,9 @@ }, { "cell_type": "code", - "execution_count": 14, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<div>\n", - "<style scoped>\n", - " .dataframe tbody tr th:only-of-type {\n", - " vertical-align: middle;\n", - " }\n", - "\n", - " .dataframe tbody tr th {\n", - " vertical-align: top;\n", - " }\n", - "\n", - " .dataframe thead th {\n", - " text-align: right;\n", - " }\n", - "</style>\n", - "<table border=\"1\" class=\"dataframe\">\n", - " <thead>\n", - " <tr style=\"text-align: right;\">\n", - " <th></th>\n", - " <th>filename</th>\n", - " <th>volume</th>\n", - " <th>number</th>\n", - " <th>head</th>\n", - " <th>normClass</th>\n", - " <th>author</th>\n", - " <th>text</th>\n", - " </tr>\n", - " </thead>\n", - " <tbody>\n", - " <tr>\n", - " <th>5</th>\n", - " <td>volume07-1002.tei</td>\n", - " <td>7</td>\n", - " <td>1002</td>\n", - " <td>FRONTIGNAN</td>\n", - " <td>Géographie</td>\n", - " <td>Jaucourt</td>\n", - " <td>FRONTIGNAN, (Géog.) petite ville de France. au...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>29</th>\n", - " <td>volume07-1024.tei</td>\n", - " <td>7</td>\n", - " <td>1024</td>\n", - " <td>FROWARD, le cap.</td>\n", - " <td>Géographie</td>\n", - " <td>Jaucourt</td>\n", - " <td>FROWARD, le cap. (Géog.) & par les François le...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>32</th>\n", - " <td>volume07-1027.tei</td>\n", - " <td>7</td>\n", - " <td>1027</td>\n", - " <td>FRUGALITÉ</td>\n", - " <td>Morale</td>\n", - " <td>Jaucourt</td>\n", - " <td>FRUGALITÉ, (Morale.) simplicité de moeurs & de...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>37</th>\n", - " <td>volume07-1031.tei</td>\n", - " <td>7</td>\n", - " <td>1031</td>\n", - " <td>Fruit verreux</td>\n", - " <td>Histoire naturelle</td>\n", - " <td>Jaucourt</td>\n", - " <td>Fruit verreux, (Hist. nat.) c'est le nom qu'on...</td>\n", - " </tr>\n", - " <tr>\n", - " <th>38</th>\n", - " <td>volume07-1032.tei</td>\n", - " <td>7</td>\n", - " <td>1032</td>\n", - " <td>Fruit, (art de conserver le)</td>\n", - " <td>Economie rustique</td>\n", - " <td>Jaucourt</td>\n", - " <td>Fruit, (art de conserver le) Economie rustiq. ...</td>\n", - " </tr>\n", - " </tbody>\n", - "</table>\n", - "</div>" - ], - "text/plain": [ - " filename volume number head \\\n", - "5 volume07-1002.tei 7 1002 FRONTIGNAN \n", - "29 volume07-1024.tei 7 1024 FROWARD, le cap. \n", - "32 volume07-1027.tei 7 1027 FRUGALITÉ \n", - "37 volume07-1031.tei 7 1031 Fruit verreux \n", - "38 volume07-1032.tei 7 1032 Fruit, (art de conserver le) \n", - "\n", - " normClass author \\\n", - "5 Géographie Jaucourt \n", - "29 Géographie Jaucourt \n", - "32 Morale Jaucourt \n", - "37 Histoire naturelle Jaucourt \n", - "38 Economie rustique Jaucourt \n", - "\n", - " text \n", - "5 FRONTIGNAN, (Géog.) petite ville de France. au... \n", - "29 FROWARD, le cap. (Géog.) & par les François le... \n", - "32 FRUGALITÉ, (Morale.) simplicité de moeurs & de... \n", - "37 Fruit verreux, (Hist. nat.) c'est le nom qu'on... \n", - "38 Fruit, (art de conserver le) Economie rustiq. ... " - ] - }, - "execution_count": 14, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "d_Jaucourt.head()" ] @@ -725,17 +363,9 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "496 articles sont classés en Géographie\n" - ] - } - ], + "outputs": [], "source": [ "req = 'Géographie'\n", "d_geo = data_artfl[data_artfl['normClass'].str.contains(req, case=False)]\n", @@ -746,17 +376,9 @@ }, { "cell_type": "code", - "execution_count": 16, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "177 articles contiennent l'expression 'ville de'\n" - ] - } - ], + "outputs": [], "source": [ "req = 'ville de'\n", "d_geo = data_artfl[data_artfl['text'].str.contains(req, case=False)]\n", @@ -783,40 +405,9 @@ }, { "cell_type": "code", - "execution_count": 17, - "metadata": {}, - "outputs": [ - { - "data": { - "text/plain": [ - "author\n", - "Anonymous5 1\n", - "Beauzée & Douchet 1\n", - "Boucher d'Argis 14\n", - "Bouchu 1\n", - "Desmarest 1\n", - "Diderot 2\n", - "Jaucourt 141\n", - "Le Blond 1\n", - "Le Blond & d'Alembert 1\n", - "Le Roy 1\n", - "Lucotte5 1\n", - "Mallet 1\n", - "Quesnay 1\n", - "Robert de Vaugondy 1\n", - "Tressan 1\n", - "Voltaire 2\n", - "d'Alembert 2\n", - "d'Holbach 1\n", - "unsigned 3\n", - "Name: filename, dtype: int64" - ] - }, - "execution_count": 17, - "metadata": {}, - "output_type": "execute_result" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "d_geo.groupby(['author'])[\"filename\"].count()" ] @@ -867,7 +458,7 @@ }, { "cell_type": "code", - "execution_count": 19, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -883,33 +474,9 @@ }, { "cell_type": "code", - "execution_count": 20, - "metadata": {}, - "outputs": [ - { - "data": { - "application/vnd.jupyter.widget-view+json": { - "model_id": "e23cc371ca6e46e695d7e4200dbcee84", - "version_major": 2, - "version_minor": 0 - }, - "text/plain": [ - "Downloading https://raw.githubusercontent.com/stanfordnlp/stanza-resources/main/resources_1.2.2.json: 0%| …" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2022-09-29 08:23:00 INFO: Downloading default packages for language: fr (French)...\n", - "2022-09-29 08:23:01 INFO: File exists: /Users/lmoncla/stanza_resources/fr/default.zip.\n", - "2022-09-29 08:23:05 INFO: Finished downloading models and saved to /Users/lmoncla/stanza_resources.\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "stanza.download('fr')" ] @@ -923,31 +490,9 @@ }, { "cell_type": "code", - "execution_count": 21, - "metadata": {}, - "outputs": [ - { - "name": "stderr", - "output_type": "stream", - "text": [ - "2022-09-29 08:23:58 WARNING: Language fr package default expects mwt, which has been added\n", - "2022-09-29 08:23:58 INFO: Loading these models for language: fr (French):\n", - "=======================\n", - "| Processor | Package |\n", - "-----------------------\n", - "| tokenize | gsd |\n", - "| mwt | gsd |\n", - "| ner | wikiner |\n", - "=======================\n", - "\n", - "2022-09-29 08:23:58 INFO: Use device: cpu\n", - "2022-09-29 08:23:58 INFO: Loading: tokenize\n", - "2022-09-29 08:23:58 INFO: Loading: mwt\n", - "2022-09-29 08:23:58 INFO: Loading: ner\n", - "2022-09-29 08:23:59 INFO: Done loading processors!\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "stanza_parser = stanza.Pipeline(lang='fr', processors='tokenize,ner')" ] @@ -961,17 +506,9 @@ }, { "cell_type": "code", - "execution_count": 22, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* ARQUES, (Géog.) petite ville de France, en Normandie, au pays de Caux, sur la petite riviere d'Arques. Long. 18. 50. lat. 49. 54.\n" - ] - } - ], + "outputs": [], "source": [ "print(arques)" ] @@ -985,7 +522,7 @@ }, { "cell_type": "code", - "execution_count": 23, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1001,7 +538,7 @@ }, { "cell_type": "code", - "execution_count": 26, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1013,21 +550,9 @@ }, { "cell_type": "code", - "execution_count": 27, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ARQUES LOC\n", - "France LOC\n", - "Normandie LOC\n", - "pays de Caux LOC\n", - "Arques LOC\n" - ] - } - ], + "outputs": [], "source": [ "# On utilise la fonction précédente pour afficher la liste des entités repérées\n", "show_ents(arques_stanza)" @@ -1055,51 +580,9 @@ }, { "cell_type": "code", - "execution_count": 28, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "Collecting fr-core-news-sm==3.3.0\n", - " Downloading https://github.com/explosion/spacy-models/releases/download/fr_core_news_sm-3.3.0/fr_core_news_sm-3.3.0-py3-none-any.whl (16.3 MB)\n", - "\u001b[2K \u001b[90mâ”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”â”\u001b[0m \u001b[32m16.3/16.3 MB\u001b[0m \u001b[31m10.8 MB/s\u001b[0m eta \u001b[36m0:00:00\u001b[0m00:01\u001b[0m00:01\u001b[0m\n", - "\u001b[?25hRequirement already satisfied: spacy<3.4.0,>=3.3.0.dev0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from fr-core-news-sm==3.3.0) (3.3.1)\n", - "Requirement already satisfied: requests<3.0.0,>=2.13.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2.28.1)\n", - "Requirement already satisfied: spacy-legacy<3.1.0,>=3.0.9 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (3.0.10)\n", - "Requirement already satisfied: packaging>=20.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (21.3)\n", - "Requirement already satisfied: cymem<2.1.0,>=2.0.2 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2.0.6)\n", - "Requirement already satisfied: jinja2 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (3.1.2)\n", - "Requirement already satisfied: srsly<3.0.0,>=2.4.3 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2.4.4)\n", - "Requirement already satisfied: murmurhash<1.1.0,>=0.28.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (1.0.8)\n", - "Requirement already satisfied: tqdm<5.0.0,>=4.38.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (4.64.1)\n", - "Requirement already satisfied: pathy>=0.3.5 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (0.6.2)\n", - "Requirement already satisfied: thinc<8.1.0,>=8.0.14 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (8.0.17)\n", - "Requirement already satisfied: langcodes<4.0.0,>=3.2.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (3.3.0)\n", - "Requirement already satisfied: blis<0.8.0,>=0.4.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (0.7.8)\n", - "Requirement already satisfied: preshed<3.1.0,>=3.0.2 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (3.0.7)\n", - "Requirement already satisfied: wasabi<1.1.0,>=0.9.1 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (0.10.1)\n", - "Requirement already satisfied: pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (1.8.2)\n", - "Requirement already satisfied: setuptools in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (65.3.0)\n", - "Requirement already satisfied: catalogue<2.1.0,>=2.0.6 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2.0.8)\n", - "Requirement already satisfied: spacy-loggers<2.0.0,>=1.0.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (1.0.3)\n", - "Requirement already satisfied: numpy>=1.15.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (1.23.3)\n", - "Requirement already satisfied: typer<0.5.0,>=0.3.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (0.4.2)\n", - "Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from packaging>=20.0->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (3.0.9)\n", - "Requirement already satisfied: smart-open<6.0.0,>=5.2.1 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from pathy>=0.3.5->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (5.2.1)\n", - "Requirement already satisfied: typing-extensions>=3.7.4.3 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from pydantic!=1.8,!=1.8.1,<1.9.0,>=1.7.4->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (4.3.0)\n", - "Requirement already satisfied: charset-normalizer<3,>=2 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2.1.1)\n", - "Requirement already satisfied: certifi>=2017.4.17 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2022.9.14)\n", - "Requirement already satisfied: idna<4,>=2.5 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (3.4)\n", - "Requirement already satisfied: urllib3<1.27,>=1.21.1 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from requests<3.0.0,>=2.13.0->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (1.26.12)\n", - "Requirement already satisfied: click<9.0.0,>=7.1.1 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from typer<0.5.0,>=0.3.0->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (8.1.3)\n", - "Requirement already satisfied: MarkupSafe>=2.0 in /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39/lib/python3.9/site-packages (from jinja2->spacy<3.4.0,>=3.3.0.dev0->fr-core-news-sm==3.3.0) (2.1.1)\n", - "\u001b[38;5;2m✔ Download and installation successful\u001b[0m\n", - "You can now load the package via spacy.load('fr_core_news_sm')\n" - ] - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "!python -m spacy download fr_core_news_sm" ] @@ -1113,7 +596,7 @@ }, { "cell_type": "code", - "execution_count": 4, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1129,7 +612,7 @@ }, { "cell_type": "code", - "execution_count": 5, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1145,7 +628,7 @@ }, { "cell_type": "code", - "execution_count": 6, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1161,24 +644,9 @@ }, { "cell_type": "code", - "execution_count": 7, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ARQUES LOC\n", - "Géog LOC\n", - "de France LOC\n", - "Normandie LOC\n", - "pays de Caux LOC\n", - "Arques LOC\n", - "Long LOC\n", - "lat LOC\n" - ] - } - ], + "outputs": [], "source": [ "for ent in arques_spacy.ents:\n", " print(ent.text, ent.label_)" @@ -1193,62 +661,9 @@ }, { "cell_type": "code", - "execution_count": 8, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">* \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " ARQUES\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", (\n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Géog\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ".) petite ville \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " de France\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", en \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Normandie\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", au \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " pays de Caux\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", sur la petite riviere d'\n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Arques\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Long\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". 18. 50. \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " lat\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". 49. 54.</div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "displacy.render(arques_spacy, style=\"ent\", jupyter=True) " ] @@ -1257,7 +672,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "On remarque des différences entre les résultats de Stanza et de spaCy. En particulier spaCy repère trois entités à tord (faux positifs) : `Géog`, `Long` et `lat`." + "On remarque des différences entre les résultats de Stanza et de spaCy. En particulier spaCy repère trois entités à tord (faux positifs) : `Géog`, `Long` et `lat`, là où Stanza ne repérait à tord que `Géog)`. Et spaCy ne repère pas la première occurrence `ARQUES` sans doute du au fait que le mot est en majuscule." ] }, { @@ -1283,7 +698,7 @@ }, { "cell_type": "code", - "execution_count": 34, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1299,7 +714,7 @@ }, { "cell_type": "code", - "execution_count": 35, + "execution_count": null, "metadata": {}, "outputs": [], "source": [ @@ -1322,22 +737,9 @@ }, { "cell_type": "code", - "execution_count": 36, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "ARQUES place\n", - "France place\n", - "Normandie place\n", - "Caux place\n", - "Arques place\n", - "Long . 18 . 50 . lat . 49 . 54 . latlong\n" - ] - } - ], + "outputs": [], "source": [ "for ent in arques_perdido.named_entities:\n", " print(ent.text, ent.tag)" @@ -1352,52 +754,9 @@ }, { "cell_type": "code", - "execution_count": 37, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">* \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " ARQUES\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " , ( Géog . ) petite ville de \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " France\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " , en \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Normandie\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " , au pays de \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Caux\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " , sur la petite riviere d' \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Arques\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " . \n", - "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Long . 18 . 50 . lat . 49 . 54 .\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MISC</span>\n", - "</mark>\n", - " </div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "displacy.render(arques_perdido.to_spacy_doc(), style=\"ent\", jupyter=True)" ] @@ -1411,347 +770,9 @@ }, { "cell_type": "code", - "execution_count": 38, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"spans\" style=\"line-height: 2.5; direction: ltr\">\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " *\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ddd; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " MISC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " ARQUES\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ", ( Géog . ) \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " petite\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " ville\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " de\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " France\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ", en \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Normandie\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ", au \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " pays\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " de\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Caux\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ", sur \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " la\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " petite\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " riviere\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " d'\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Arques\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ". \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Long\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ddd; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " MISC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 18\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 50\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " lat\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 49\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 54\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "</div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "displacy.render(arques_perdido.to_spacy_doc(), style=\"span\", jupyter=True)" ] @@ -1779,20 +800,11 @@ }, { "cell_type": "code", - "execution_count": 9, + "execution_count": null, "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "* Beaufort, (Géog.) ville de Savoie, sur la riviere \n", - "d'Oron. Long. 24. 18. lat. 45. 40.\n" - ] - } - ], + "outputs": [], "source": [ - "beaufort = load_txt('data/volume02-1365.txt')\n", + "beaufort = load_txt('data/edda-volume02-1365.txt')\n", "\n", "print(beaufort)" ] @@ -1806,301 +818,9 @@ }, { "cell_type": "code", - "execution_count": 40, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">* \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Beaufort\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " , ( Géog . ) ville de \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Savoie\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " , sur la riviere d' \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Oron\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - " . \n", - "<mark class=\"entity\" style=\"background: #ddd; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Long . 24 . 18 . lat . 45 . 40 .\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">MISC</span>\n", - "</mark>\n", - " </div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - }, - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"spans\" style=\"line-height: 2.5; direction: ltr\">\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " *\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ddd; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " MISC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Beaufort\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ", ( Géog . ) \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " ville\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " de\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Savoie\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ", sur \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " la\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " riviere\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " d'\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Oron\n", - " \n", - "<span style=\"background: #ff9561; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ff9561; top: 57px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ff9561; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " LOC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - ". \n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " Long\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; border-top-left-radius: 3px; border-bottom-left-radius: 3px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - " <span style=\"background: #ddd; z-index: 10; color: #000; top: -0.5em; padding: 2px 3px; position: absolute; font-size: 0.6em; font-weight: bold; line-height: 1; border-radius: 3px\">\n", - " MISC\n", - " </span>\n", - "</span>\n", - "\n", - "\n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 24\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 18\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " lat\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 45\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " 40\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "\n", - "<span style=\"font-weight: bold; display: inline-block; position: relative;\">\n", - " .\n", - " \n", - "<span style=\"background: #ddd; top: 40px; height: 4px; left: -1px; width: calc(100% + 2px); position: absolute;\">\n", - "</span>\n", - "\n", - " \n", - "</span>\n", - "</div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "beaufort_perdido = geoparser(beaufort)\n", "displacy.render(beaufort_perdido.to_spacy_doc(), style=\"ent\", jupyter=True)\n", @@ -2116,42 +836,9 @@ }, { "cell_type": "code", - "execution_count": 10, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">* \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Beaufort\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", (\n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Géog\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ".) ville de \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Savoie\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", sur la riviere </br>d'Oron. Long. 24. 18. \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " lat\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". 45. 40.</div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], + "execution_count": null, + "metadata": {}, + "outputs": [], "source": [ "beaufort_spacy = spacy_parser(beaufort)\n", "displacy.render(beaufort_spacy, style=\"ent\", jupyter=True) " @@ -2161,237 +848,17 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Le retour à la ligne entre `riviere` et `d'Oron` est due à la largeur de la colonne dans l'Å“uvre originale. \n", - "Ce retour semble perturber spaCy qui ne reconnait pas `Oron` comme une entité nommée.\n", - "\n", - "\n", - "\n", - "Pour vérifier cette hypothèse, modifions le texte en supprimant ce saut de ligne pour voir s'il est possible d'améliorer la reconnaissance." - ] - }, - { - "cell_type": "code", - "execution_count": 1, - "metadata": {}, - "outputs": [ - { - "name": "stdout", - "output_type": "stream", - "text": [ - "# packages in environment at /usr/local/Caskroom/miniforge/base/envs/tdm-geoparsing-py39:\n", - "#\n", - "# Name Version Build Channel\n", - "appnope 0.1.3 pypi_0 pypi\n", - "argon2-cffi 21.3.0 pypi_0 pypi\n", - "argon2-cffi-bindings 21.2.0 pypi_0 pypi\n", - "asttokens 2.0.8 pypi_0 pypi\n", - "attrs 22.1.0 pypi_0 pypi\n", - "backcall 0.2.0 pypi_0 pypi\n", - "beautifulsoup4 4.11.1 pypi_0 pypi\n", - "bleach 5.0.1 pypi_0 pypi\n", - "branca 0.5.0 pypi_0 pypi\n", - "brotlipy 0.7.0 py39h63b48b0_1004 conda-forge\n", - "bzip2 1.0.8 h0d85af4_4 conda-forge\n", - "ca-certificates 2022.9.24 h033912b_0 conda-forge\n", - "catalogue 2.0.8 py39h6e9494a_0 conda-forge\n", - "certifi 2022.9.14 pypi_0 pypi\n", - "cffi 1.15.1 py39hae9ecf2_0 conda-forge\n", - "charset-normalizer 2.1.1 pyhd8ed1ab_0 conda-forge\n", - "click 8.1.3 py39h6e9494a_0 conda-forge\n", - "click-plugins 1.1.1 pypi_0 pypi\n", - "cligj 0.7.2 pypi_0 pypi\n", - "colorama 0.4.5 pyhd8ed1ab_0 conda-forge\n", - "contourpy 1.0.5 pypi_0 pypi\n", - "cryptography 37.0.4 py39h9c2a9ce_0 conda-forge\n", - "cycler 0.11.0 pypi_0 pypi\n", - "cymem 2.0.6 py39hfd1d529_3 conda-forge\n", - "cython-blis 0.7.8 py39h15b18c7_0 conda-forge\n", - "dataclasses 0.8 pyhc8e2a94_3 conda-forge\n", - "debugpy 1.6.3 pypi_0 pypi\n", - "decorator 5.1.1 pypi_0 pypi\n", - "defusedxml 0.7.1 pypi_0 pypi\n", - "entrypoints 0.4 pypi_0 pypi\n", - "executing 1.0.0 pypi_0 pypi\n", - "fastjsonschema 2.16.2 pypi_0 pypi\n", - "fiona 1.8.21 pypi_0 pypi\n", - "folium 0.12.1.post1 pypi_0 pypi\n", - "fonttools 4.37.3 pypi_0 pypi\n", - "fr-core-news-sm 3.3.0 pypi_0 pypi\n", - "geojson 2.5.0 pypi_0 pypi\n", - "geopandas 0.11.1 pypi_0 pypi\n", - "gpxpy 1.5.0 pypi_0 pypi\n", - "idna 3.4 pyhd8ed1ab_0 conda-forge\n", - "importlib-metadata 4.12.0 pypi_0 pypi\n", - "ipykernel 6.15.3 pypi_0 pypi\n", - "ipython 8.5.0 pypi_0 pypi\n", - "ipython-genutils 0.2.0 pypi_0 pypi\n", - "ipywidgets 8.0.2 pypi_0 pypi\n", - "jedi 0.18.1 pypi_0 pypi\n", - "jinja2 3.1.2 pyhd8ed1ab_1 conda-forge\n", - "joblib 1.2.0 pypi_0 pypi\n", - "jsonschema 4.16.0 pypi_0 pypi\n", - "jupyter 1.0.0 pypi_0 pypi\n", - "jupyter-client 7.3.5 pypi_0 pypi\n", - "jupyter-console 6.4.4 pypi_0 pypi\n", - "jupyter-core 4.11.1 pypi_0 pypi\n", - "jupyterlab-pygments 0.2.2 pypi_0 pypi\n", - "jupyterlab-widgets 3.0.3 pypi_0 pypi\n", - "kiwisolver 1.4.4 pypi_0 pypi\n", - "langcodes 3.3.0 pyhd8ed1ab_0 conda-forge\n", - "libblas 3.9.0 16_osx64_openblas conda-forge\n", - "libcblas 3.9.0 16_osx64_openblas conda-forge\n", - "libcxx 14.0.6 hccf4f1f_0 conda-forge\n", - "libffi 3.4.2 h0d85af4_5 conda-forge\n", - "libgfortran 5.0.0 10_4_0_h97931a8_25 conda-forge\n", - "libgfortran5 11.3.0 h082f757_25 conda-forge\n", - "liblapack 3.9.0 16_osx64_openblas conda-forge\n", - "libopenblas 0.3.21 openmp_h429af6e_3 conda-forge\n", - "libsqlite 3.39.3 ha978bb4_0 conda-forge\n", - "libzlib 1.2.12 hfd90126_3 conda-forge\n", - "llvm-openmp 14.0.4 ha654fa7_0 conda-forge\n", - "lxml 4.9.1 pypi_0 pypi\n", - "markupsafe 2.1.1 py39h63b48b0_1 conda-forge\n", - "matplotlib 3.6.0 pypi_0 pypi\n", - "matplotlib-inline 0.1.6 pypi_0 pypi\n", - "mistune 2.0.4 pypi_0 pypi\n", - "munch 2.5.0 pypi_0 pypi\n", - "murmurhash 1.0.8 py39hd91caee_0 conda-forge\n", - "nbclient 0.6.8 pypi_0 pypi\n", - "nbconvert 7.0.0 pypi_0 pypi\n", - "nbformat 5.5.0 pypi_0 pypi\n", - "ncurses 6.3 h96cf925_1 conda-forge\n", - "nest-asyncio 1.5.5 pypi_0 pypi\n", - "notebook 6.4.12 pypi_0 pypi\n", - "numpy 1.23.3 py39h34843a6_0 conda-forge\n", - "openssl 1.1.1q hfe4f2af_0 conda-forge\n", - "packaging 21.3 pyhd8ed1ab_0 conda-forge\n", - "pandas 1.5.0 pypi_0 pypi\n", - "pandocfilters 1.5.0 pypi_0 pypi\n", - "parso 0.8.3 pypi_0 pypi\n", - "pathy 0.6.2 pyhd8ed1ab_0 conda-forge\n", - "perdido 0.1.27 pypi_0 pypi\n", - "pexpect 4.8.0 pypi_0 pypi\n", - "pickleshare 0.7.5 pypi_0 pypi\n", - "pillow 9.2.0 pypi_0 pypi\n", - "pip 22.2.2 pyhd8ed1ab_0 conda-forge\n", - "preshed 3.0.7 py39hd91caee_0 conda-forge\n", - "prometheus-client 0.14.1 pypi_0 pypi\n", - "prompt-toolkit 3.0.31 pypi_0 pypi\n", - "protobuf 4.21.6 pypi_0 pypi\n", - "psutil 5.9.2 pypi_0 pypi\n", - "ptyprocess 0.7.0 pypi_0 pypi\n", - "pure-eval 0.2.2 pypi_0 pypi\n", - "pycparser 2.21 pyhd8ed1ab_0 conda-forge\n", - "pydantic 1.8.2 pypi_0 pypi\n", - "pygments 2.13.0 pypi_0 pypi\n", - "pyopenssl 22.0.0 pyhd8ed1ab_1 conda-forge\n", - "pyparsing 3.0.9 pyhd8ed1ab_0 conda-forge\n", - "pyproj 3.4.0 pypi_0 pypi\n", - "pyrsistent 0.18.1 pypi_0 pypi\n", - "pysocks 1.7.1 pyha2e5f31_6 conda-forge\n", - "python 3.9.13 h57e37ff_0_cpython conda-forge\n", - "python-dateutil 2.8.2 pypi_0 pypi\n", - "python_abi 3.9 2_cp39 conda-forge\n", - "pytz 2022.2.1 pypi_0 pypi\n", - "pyzmq 24.0.1 pypi_0 pypi\n", - "qtconsole 5.3.2 pypi_0 pypi\n", - "qtpy 2.2.0 pypi_0 pypi\n", - "readline 8.1.2 h3899abd_0 conda-forge\n", - "requests 2.28.1 pyhd8ed1ab_1 conda-forge\n", - "scikit-learn 1.1.2 pypi_0 pypi\n", - "scipy 1.9.1 pypi_0 pypi\n", - "send2trash 1.8.0 pypi_0 pypi\n", - "setuptools 65.3.0 pyhd8ed1ab_1 conda-forge\n", - "shapely 1.8.4 pypi_0 pypi\n", - "shellingham 1.5.0 pyhd8ed1ab_0 conda-forge\n", - "six 1.16.0 pypi_0 pypi\n", - "smart_open 5.2.1 pyhd8ed1ab_0 conda-forge\n", - "soupsieve 2.3.2.post1 pypi_0 pypi\n", - "spacy 3.3.1 pypi_0 pypi\n", - "spacy-legacy 3.0.10 pyhd8ed1ab_0 conda-forge\n", - "spacy-loggers 1.0.3 pyhd8ed1ab_0 conda-forge\n", - "sqlite 3.39.3 h9ae0607_0 conda-forge\n", - "srsly 2.4.4 py39hd408605_0 conda-forge\n", - "stack-data 0.5.0 pypi_0 pypi\n", - "stanza 1.2.3 pypi_0 pypi\n", - "terminado 0.15.0 pypi_0 pypi\n", - "thinc 8.0.17 pypi_0 pypi\n", - "threadpoolctl 3.1.0 pypi_0 pypi\n", - "tinycss2 1.1.1 pypi_0 pypi\n", - "tk 8.6.12 h5dbffcc_0 conda-forge\n", - "torch 1.12.1 pypi_0 pypi\n", - "tornado 6.2 pypi_0 pypi\n", - "tqdm 4.64.1 pyhd8ed1ab_0 conda-forge\n", - "traitlets 5.4.0 pypi_0 pypi\n", - "typer 0.4.2 pyhd8ed1ab_0 conda-forge\n", - "typing-extensions 4.3.0 hd8ed1ab_0 conda-forge\n", - "typing_extensions 4.3.0 pyha770c72_0 conda-forge\n", - "tzdata 2022c h191b570_0 conda-forge\n", - "urllib3 1.26.12 pypi_0 pypi\n", - "wasabi 0.10.1 pypi_0 pypi\n", - "wcwidth 0.2.5 pypi_0 pypi\n", - "webencodings 0.5.1 pypi_0 pypi\n", - "wheel 0.37.1 pyhd8ed1ab_0 conda-forge\n", - "widgetsnbextension 4.0.3 pypi_0 pypi\n", - "xz 5.2.6 h775f41a_0 conda-forge\n", - "zipp 3.8.1 pypi_0 pypi\n" - ] - } - ], - "source": [ - "!conda list" - ] - }, - { - "cell_type": "code", - "execution_count": 12, - "metadata": {}, - "outputs": [ - { - "data": { - "text/html": [ - "<span class=\"tex2jax_ignore\"><div class=\"entities\" style=\"line-height: 2.5; direction: ltr\">* \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Beaufort\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", (\n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Géog\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ".) ville de \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Savoie\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ", sur la riviere d'\n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Oron\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " Long\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". 24. 18. \n", - "<mark class=\"entity\" style=\"background: #ff9561; padding: 0.45em 0.6em; margin: 0 0.25em; line-height: 1; border-radius: 0.35em;\">\n", - " lat\n", - " <span style=\"font-size: 0.8em; font-weight: bold; line-height: 1; border-radius: 0.35em; vertical-align: middle; margin-left: 0.5rem\">LOC</span>\n", - "</mark>\n", - ". 45. 40.</div></span>" - ], - "text/plain": [ - "<IPython.core.display.HTML object>" - ] - }, - "metadata": {}, - "output_type": "display_data" - } - ], - "source": [ - "normalized_beaufort = beaufort.replace('\\n', '')\n", + "Dans cet exemple, `spaCy` repère le mot `Oron` comme une entité de personne alors que `Perdido` le repère comme un lieu.\n", + "On observe qu'il manque l'accent au mot «rivière». Corrigeons le texte pour voir s'il est possible d'améliorer la reconnaissance.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [ + "normalized_beaufort = beaufort.replace('riviere', 'rivière')\n", "\n", "normalized_beaufort_spacy = spacy_parser(normalized_beaufort)\n", "\n", @@ -2402,7 +869,15 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Apparemment ça n'améliore rien, mais il manque encore l'accent à «rivière»." + "\n", + "Ce changement ne corrige pas l'erreur d'annotation, au contraire l'entité n'est même plus repérée. Cependant, on observe également un saut de ligne entre les mots «rivière» et «d'Oron».\n", + "Ce retour à la ligne est due à la largeur de la colonne dans l'Å“uvre originale. \n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Pour vérifier l'hypothèse que ce retour perturbe le repérage par `spaCy`, corrigeons une nouvelle fois le texte.\n" ] }, { @@ -2411,8 +886,10 @@ "metadata": {}, "outputs": [], "source": [ - "normalized_beaufort = normalized_beaufort.replace('riviere', 'rivière')\n", + "normalized_beaufort = normalized_beaufort.replace('\\n', '')\n", + "\n", "normalized_beaufort_spacy = spacy_parser(normalized_beaufort)\n", + "\n", "displacy.render(normalized_beaufort_spacy, style=\"ent\", jupyter=True) " ] }, @@ -2420,9 +897,9 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Cette fois l'entité étendue incluant le nom commun «rivière» a été reconnu par SpaCy, qui a pu ainsi corriger le type de l'entité nommée et se rendre compte que l'Oron était un endroit et pas une personne.\n", + "Cette fois l'entité étendue incluant le nom commun «rivière» a été reconnu par `spaCy`, qui a pu ainsi corriger le type de l'entité nommée et se rendre compte que l'Oron était un lieu et pas une personne.\n", "\n", - "Essayons maintenant avec Stanza." + "Essayons maintenant avec `Stanza`." ] }, { @@ -2457,7 +934,7 @@ "metadata": {}, "outputs": [], "source": [ - "lge_beaufort = load('data/beaufort.txt')\n", + "lge_beaufort = load_txt('data/lge-beaufort.txt')\n", "print(lge_beaufort)" ] }, @@ -2465,7 +942,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "Cette fois l'article est un peu plus long et comporte des césures de lignes importantes, définissons donc une fonction pour recoller les morceaux:" + "Cette fois l'article est un peu plus long et comporte des césures de lignes importantes, définissons donc une fonction pour recoller les morceaux :" ] }, { @@ -2484,7 +961,15 @@ "metadata": {}, "outputs": [], "source": [ - "lge_beaufort_perdido = geoparser(join_lines(lge_beaufort))" + "normalized_lge_beaufort = join_lines(lge_beaufort)\n", + "normalized_lge_beaufort" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* Perdido" ] }, { @@ -2493,16 +978,15 @@ "metadata": {}, "outputs": [], "source": [ + "lge_beaufort_perdido = geoparser(normalized_lge_beaufort)\n", "displacy.render(lge_beaufort_perdido.to_spacy_doc(), style=\"span\", jupyter=True)" ] }, { - "cell_type": "code", - "execution_count": null, + "cell_type": "markdown", "metadata": {}, - "outputs": [], "source": [ - "lge_beaufort_spacy = spacy_parser(join_lines(lge_beaufort))" + "* spaCy" ] }, { @@ -2511,16 +995,24 @@ "metadata": {}, "outputs": [], "source": [ + "lge_beaufort_spacy = spacy_parser(normalized_lge_beaufort)\n", "displacy.render(lge_beaufort_spacy, style=\"ent\", jupyter=True)" ] }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "* Stanza" + ] + }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ - "lge_beaufort_stanza = stanza_parser(lge_beaufort)\n", + "lge_beaufort_stanza = stanza_parser(normalized_lge_beaufort)\n", "show_ents(lge_beaufort_stanza)" ] }, @@ -2528,7 +1020,10 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "L'analyse prend plus de temps avec Stanza mais les résultats ont l'air un peu plus précis sur cet exemple. Il y a également une meilleure couverture: Henri IV et 1841 sont annotés, comme avec Perdido, jusqu'à Saint-Maxime-de-Bf.aufort qui a été identifié malgré l'erreur d'OCR, bien que mal classé." + "Quelques observations : \n", + "1. Seul Perdido repère la date (1841).\n", + "2. spaCy ne classe pas correctement Albertville (Personne) contrairement à Perdido et Stanza (Lieu), spaCy ne repère pas l'entité Heni IV contrairement à Perdido et Stanza.\n", + "3. Stanza repère et classe correctement l'entité \"Saint-Maximede-Bf.aufort\", Perdido la repère mais ne sait pas la classer et spaCy ne la repère pas." ] }, { @@ -2581,15 +1076,6 @@ "arques_perdido.get_folium_map()" ] }, - { - "cell_type": "code", - "execution_count": null, - "metadata": {}, - "outputs": [], - "source": [ - "displacy.render(doc.to_spacy_doc(), style=\"ent\", jupyter=True)" - ] - }, { "cell_type": "markdown", "metadata": {}, @@ -2605,7 +1091,7 @@ "source": [ "### 6.2 Perdido Geocoder\n", "\n", - "En complément du `Geoparser` qui prend en paramètre un texte et qui fait la reconnaissance d'entités nommées en amont de l'étape de geocoding, `Perdido`propose également une fonction de geocoding disctincte prenant en paramètre directement un nom de lieu (ou une liste de noms de lieux)." + "En complément du `Geoparser` qui prend en paramètre un texte et qui fait la reconnaissance d'entités nommées en amont de l'étape de geocoding, `Perdido` propose également une fonction de geocoding disctincte prenant en paramètre directement un nom de lieu (ou une liste de noms de lieux)." ] }, { @@ -2648,7 +1134,7 @@ "cell_type": "markdown", "metadata": {}, "source": [ - "On remarque que par défaut, la localisation retournée pour le nom de lieu `Arques` n'est pas celle que l'on recherche. En effet, le texte indique qu'il s'agit d'une ville de Normandie hors ici la localisation proposée est située dans le Pas-de-Calais !\n", + "On remarque que par défaut, la localisation retournée pour le nom de lieu `Arques` n'est pas celle que l'on recherche. En effet, le texte indique qu'il s'agit d'une ville de Normandie, or ici la localisation proposée est située dans le Pas-de-Calais !\n", "\n", "Changeons les paramètres du `Geocoder` (ces paramètres sont similaires pour le `Geoparser`) pour essayer de retrouver la bonne localisation.\n", "\n", @@ -2739,7 +1225,7 @@ "outputs": [], "source": [ "geoparser = Geoparser(sources=['ign'], max_rows=10)\n", - "doc = geoparser(content)\n", + "doc = geoparser(arques)\n", "doc.get_folium_map()" ] }, @@ -2783,7 +1269,6 @@ "dataset_choucas = load_choucas_perdido()\n", "data_choucas = dataset_choucas['data']\n", "\n", - "\n", "data_choucas.to_dataframe().head()" ] }, @@ -2961,12 +1446,31 @@ "doc_geocoded.get_folium_map()" ] }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, { "cell_type": "markdown", "metadata": {}, - "source": [ - "Utilisation du contexte (autres entités nommées repérées dans le texte, relations spatiales, etc...). Développées dans le cadre du projet [Perdido]() (add ref 2014 et 2016) mais pas encore intégré à la librairie Python Perdido. Cette librairie est toujours en cours de développement et d'amélioration. Vos remarques et retours seront les bienvenues." - ] + "source": [] } ], "metadata": { @@ -2977,7 +1481,7 @@ "toc_visible": true }, "kernelspec": { - "display_name": "Python 3.9.13 ('tdm-geoparsing-py39')", + "display_name": "Python 3 (ipykernel)", "language": "python", "name": "python3" },