diff --git a/README.md b/README.md index 9aa97b2f50319cb1edcafd7e1b8bca64f2cbf182..dba456c8031fd9db31e0c777528df9b821ead04b 100644 --- a/README.md +++ b/README.md @@ -1,2 +1,14 @@ -# Sampling Géo-FR EDdA-DUT +# Echantillonnage des articles de Géographie française dans l'Encyclopédie de Diderot et d'Alembert +Ce dépôt est proposé par **Ludovic Moncla** et **Denis Vigier** dans le cadre du [Projet GEODE](https://geode-project.github.io/). +Il contient le code développé pour la sélection de l'échantillon d'articles traitant de géographie française dans l'Encyclopédie de Diderot et d'Alembert (EDdA) et le Dictionnaire Universel de Trevoux (DUT) + +## Présentation + + + + + +## Remerciements + +Les auteurs remercient le [LABEX ASLAN](https://aslan.universite-lyon.fr/) (ANR-10-LABX-0081) de l'Université de Lyon pour son soutien financier dans le cadre du programme français "Investissements d'Avenir" géré par l'Agence Nationale de la Recherche (ANR). diff --git a/figures/schema.png b/figures/schema.png new file mode 100644 index 0000000000000000000000000000000000000000..cb60773fb89e93902f97b28cd3d1d74eed79b798 Binary files /dev/null and b/figures/schema.png differ diff --git a/samplingGeoFR-EDdA.ipynb b/samplingGeoFR-EDdA.ipynb new file mode 100644 index 0000000000000000000000000000000000000000..a67afbda00ce0bc1a865374af2884739e514ee98 --- /dev/null +++ b/samplingGeoFR-EDdA.ipynb @@ -0,0 +1,4719 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Filtrage des articles de géographies de l'EDDA\n", + "\n", + "Ce notebook est proposé par [L. Moncla](https://ludovicmoncla.github.io/) et [D. Vigier](http://www.icar.cnrs.fr/membre/dvigier/) dans le cadre du projet [GEODE](https://geode-project.github.io/).\n", + "\n", + "Pour la publication proposée pour le numéro de Langue Française, on souhaite filtrer les articles de l'EDDA qui décrivent un lieu localisé en France. Un sous-ensemble de ces articles sera sélectionné aléatoirement et comparé au Trevoux.\n", + "On propose de faire 4 sous-groupes d'articles en fonction de leur auteur :\n", + "1. Diderot\n", + "2. Jaucourt\n", + "3. Autre auteur\n", + "4. Non signé\n", + "\n", + "Une fois ces 4 sous-groupes sélectionné on fait une nouvelle sélection en fonction de la longueur de l'article (nombre de mots). On redécoupe en 4 sous-groupes en fonction des quartiles.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 1. Import des librairies" + ] + }, + { + "cell_type": "code", + "execution_count": 1, + "metadata": {}, + "outputs": [], + "source": [ + "import random\n", + "import shutil\n", + "import lxml.etree as etree\n", + "from sentence_splitter import SentenceSplitter, split_text_into_sentences\n", + "import re\n", + "import pandas as pd\n", + "import numpy as np" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 2. Récupération des données issues de PERDIDO\n", + "\n", + "Les données sont issues du concordancier produit par Perdido." + ] + }, + { + "cell_type": "code", + "execution_count": 52, + "metadata": {}, + "outputs": [], + "source": [ + "# on charge les données du csv dans un dataframe\n", + "# fichier TSV généré par le script /Users/lmoncla/Nextcloud/Recherche/Projets/2019-MSH_GeÌoDISCO/Scripts/parsers/concordancierPERDIDO.py\n", + "\n", + "#data = pd.read_csv('../Data/statsPERDIDO_EDDAGeo.tsv', sep='\\t')\n", + "data = pd.read_csv('../Data/statsPERDIDO_EDDA_21_10_11.tsv', sep='\\t')\n", + "data = data.sort_values(by=['volume', 'number'])" + ] + }, + { + "cell_type": "code", + "execution_count": 53, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>6861</th>\n", + " <td>volume01-1</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ENCYCLOPÉDIE, DICTIONNAIRE RAISONNÉ DES SCIENC...</td>\n", + " <td>Title Page</td>\n", + " <td>unclassified</td>\n", + " <td>unsigned</td>\n", + " <td>129</td>\n", + " <td>24</td>\n", + " <td>10</td>\n", + " <td>4</td>\n", + " <td>8</td>\n", + " <td>4</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>22907</th>\n", + " <td>volume01-2</td>\n", + " <td>1</td>\n", + " <td>2</td>\n", + " <td>A MONSEIGNEUR LE COMTE D'ARGENSON, MINISTRE ET...</td>\n", + " <td>A MONSEIGNEUR LE COMTE D'ARGENSON</td>\n", + " <td>unclassified</td>\n", + " <td>Diderot & d'Alembert</td>\n", + " <td>252</td>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13935</th>\n", + " <td>volume01-3</td>\n", + " <td>1</td>\n", + " <td>3</td>\n", + " <td>DISCOURS PRÉLIMINAIRE DES EDITEURS. L'Encyclop...</td>\n", + " <td>DISCOURS PRÉLIMINAIRE DES EDITEURS</td>\n", + " <td>unclassified</td>\n", + " <td>d'Alembert</td>\n", + " <td>49007</td>\n", + " <td>1013</td>\n", + " <td>379</td>\n", + " <td>177</td>\n", + " <td>279</td>\n", + " <td>6</td>\n", + " <td>19</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>17096</th>\n", + " <td>volume01-4</td>\n", + " <td>1</td>\n", + " <td>4</td>\n", + " <td>ENCYCLOPÉDIE, DICTIONNAIRE RAISONNÉ DES SCIENC...</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>NaN</td>\n", + " <td>10</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>20692</th>\n", + " <td>volume01-5</td>\n", + " <td>1</td>\n", + " <td>5</td>\n", + " <td>A, a & a s.m. (ordre Encyclopéd. Entend. Scien...</td>\n", + " <td>A, a & a</td>\n", + " <td>Grammaire</td>\n", + " <td>Dumarsais5</td>\n", + " <td>856</td>\n", + " <td>28</td>\n", + " <td>8</td>\n", + " <td>11</td>\n", + " <td>7</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "6861 volume01-1 1 1 \n", + "22907 volume01-2 1 2 \n", + "13935 volume01-3 1 3 \n", + "17096 volume01-4 1 4 \n", + "20692 volume01-5 1 5 \n", + "\n", + " content \\\n", + "6861 ENCYCLOPÉDIE, DICTIONNAIRE RAISONNÉ DES SCIENC... \n", + "22907 A MONSEIGNEUR LE COMTE D'ARGENSON, MINISTRE ET... \n", + "13935 DISCOURS PRÉLIMINAIRE DES EDITEURS. L'Encyclop... \n", + "17096 ENCYCLOPÉDIE, DICTIONNAIRE RAISONNÉ DES SCIENC... \n", + "20692 A, a & a s.m. (ordre Encyclopéd. Entend. Scien... \n", + "\n", + " headword normClass author \\\n", + "6861 Title Page unclassified unsigned \n", + "22907 A MONSEIGNEUR LE COMTE D'ARGENSON unclassified Diderot & d'Alembert \n", + "13935 DISCOURS PRÉLIMINAIRE DES EDITEURS unclassified d'Alembert \n", + "17096 NaN NaN NaN \n", + "20692 A, a & a Grammaire Dumarsais5 \n", + "\n", + " nb Words nb EN nb Name EDDA nb Person nb ENE nb ENE Place \\\n", + "6861 129 24 10 4 8 4 \n", + "22907 252 5 0 0 3 0 \n", + "13935 49007 1013 379 177 279 6 \n", + "17096 10 0 0 0 0 0 \n", + "20692 856 28 8 11 7 1 \n", + "\n", + " nb ENE Person nb EN geocoded nb EN EDDA geocoded type latlong \\\n", + "6861 0 0 0 NaN False \n", + "22907 2 0 0 NaN False \n", + "13935 19 0 0 NaN False \n", + "17096 0 0 0 NaN False \n", + "20692 1 0 0 NaN False \n", + "\n", + " latlong value \n", + "6861 NaN \n", + "22907 NaN \n", + "13935 NaN \n", + "17096 NaN \n", + "20692 NaN " + ] + }, + "execution_count": 53, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# On affiche les premières lignes\n", + "data.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 54, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "74165" + ] + }, + "execution_count": 54, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "# Nombre d'articles présents dans ce jeu de données.\n", + "len(data)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 2.1. Calcul des quartiles (par rapport au nombre de mots) pour l'ensemble des articles de géographie" + ] + }, + { + "cell_type": "code", + "execution_count": 5, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "25.0 43.0 86.0\n" + ] + } + ], + "source": [ + "q1, q2, q3 = data['nb Words'].quantile([0.25, 0.5, 0.75])\n", + "\n", + "print(q1, q2, q3)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 3. Filtrage selon si la premiere phrase contient \"classifieur de France\"" + ] + }, + { + "cell_type": "code", + "execution_count": 6, + "metadata": {}, + "outputs": [], + "source": [ + "splitter = SentenceSplitter(language='fr')\n", + "\n", + "def filtreFrance(content):\n", + " # initialisation de la 1ere variable de sortie\n", + " found = False\n", + " classifieur = ''\n", + " \n", + " # liste des mots qui peuvent être classifieurs de \"de France\"\n", + " list_classifieurs = \"ville|Ville|riviere|rivieres|ile|Ile|île|isle|iles|îles|province|fleuve|bourg|Bourg|montagne|montagnes|lieu|royaume|Royaume|pays|village|port|bourgade|promontoire|Promontoire|comté|lac|lacs|forteresse|golfe|golphe|cap|capitale|canton|vallée|place|principauté|château|fauxbourg|fauxbourgs|fontaine|forêt|forêts|gouvernement|municipe|maison|nation|palatinat|Palatinat|campagne|duché|bailliage|bois|capitainerie|contrée|état|marais|cercle|district|eaux|écueil|écueils|paroisse|plaine|quartier|champ|endroit|forum|Forum|havre|passage|pont|ruisseau|terre|torrent|volcan|abbaye|baronie|capitainie|champ|champs|chef|chemin|cité|colline|désert|empire|détroit|entrepôt|fauxbourg|grotte|habitation|isthme|marquisat|mont|mur|palais|péninsule|préfecture|Province|rade|région|rocher|route|ruines|salines|seigneurie|station|territoire|hameau|mer|rue\"\n", + " \n", + " # on segmente le texte en phrases\n", + " sentences = splitter.split(text=content)\n", + " m = re.search(\"(\"+list_classifieurs+\") (\\w+\\s){0,3}de France\", sentences[0])\n", + " if m: \n", + " found = True\n", + " classifieur = m.group(1)\n", + " else:\n", + " pos = sentences[0].find('de France')\n", + " if pos > -1:\n", + " found = True\n", + " \n", + " return found, classifieur\n", + "\n", + "## On vectorise la fonction afin de l'appliquer de manière efficace (en terme de temps de calcul) sur le dataframe\n", + "v_filtreFrance = np.vectorize(filtreFrance)" + ] + }, + { + "cell_type": "code", + "execution_count": 7, + "metadata": {}, + "outputs": [], + "source": [ + "data['de France'], data['classifieur de France'] = v_filtreFrance(data.content)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 8, + "metadata": {}, + "outputs": [], + "source": [ + "data_france = data[(data['de France'] == True)]" + ] + }, + { + "cell_type": "code", + "execution_count": 9, + "metadata": {}, + "outputs": [], + "source": [ + "data_france_cl = data_france[(data_france['classifieur de France'] != \"\")]" + ] + }, + { + "cell_type": "code", + "execution_count": 10, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1450, 21)" + ] + }, + "execution_count": 10, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_france.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(1415, 21)" + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_france_cl.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 12, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Il y a 1415 articles avec \"de France\" et un classifieur et 35 sans classifieur\n" + ] + } + ], + "source": [ + "print('Il y a '+ str(len(data_france_cl)) + ' articles avec \"de France\" et un classifieur et ' +str(len(data_france)-len(data_france_cl))+ ' sans classifieur')\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 4. Filtrage des sous-groupes selon les auteurs\n", + "\n", + "### 4.1. Articles de géographie signés par Diderot" + ] + }, + { + "cell_type": "code", + "execution_count": 13, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(172, 21)" + ] + }, + "execution_count": 13, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_diderot = data_france_cl[(data_france_cl['author'] == 'Diderot')]\n", + "data_diderot.shape\n" + ] + }, + { + "cell_type": "code", + "execution_count": 14, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>...</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " <th>de France</th>\n", + " <th>classifieur de France</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>4962</th>\n", + " <td>volume01-890</td>\n", + " <td>1</td>\n", + " <td>890</td>\n", + " <td>* ADOUR, (Géog. mod.) riviere de France qui pr...</td>\n", + " <td>ADOUR</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>42</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>hydronyme</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>riviere</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3616</th>\n", + " <td>volume01-1065</td>\n", + " <td>1</td>\n", + " <td>1065</td>\n", + " <td>* Afrique, (Géog. mod.) petite ville de France...</td>\n", + " <td>Afrique</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>12</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13155</th>\n", + " <td>volume01-1087</td>\n", + " <td>1</td>\n", + " <td>1087</td>\n", + " <td>* AGDE, (Géog.) ville de France en Languedoc, ...</td>\n", + " <td>AGDE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>34</td>\n", + " <td>6</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>8738</th>\n", + " <td>volume01-1103</td>\n", + " <td>1</td>\n", + " <td>1103</td>\n", + " <td>* AGEN, (Géog.) ancienne ville de France, capi...</td>\n", + " <td>AGEN</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>28</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1678</th>\n", + " <td>volume01-1210</td>\n", + " <td>1</td>\n", + " <td>1210</td>\n", + " <td>* AGRERE (Géog.) petite ville de France dans l...</td>\n", + " <td>AGRERE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>13</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>5 rows × 21 columns</p>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "4962 volume01-890 1 890 \n", + "3616 volume01-1065 1 1065 \n", + "13155 volume01-1087 1 1087 \n", + "8738 volume01-1103 1 1103 \n", + "1678 volume01-1210 1 1210 \n", + "\n", + " content headword \\\n", + "4962 * ADOUR, (Géog. mod.) riviere de France qui pr... ADOUR \n", + "3616 * Afrique, (Géog. mod.) petite ville de France... Afrique \n", + "13155 * AGDE, (Géog.) ville de France en Languedoc, ... AGDE \n", + "8738 * AGEN, (Géog.) ancienne ville de France, capi... AGEN \n", + "1678 * AGRERE (Géog.) petite ville de France dans l... AGRERE \n", + "\n", + " normClass author nb Words nb EN nb Name EDDA ... \\\n", + "4962 Géographie moderne Diderot 42 4 4 ... \n", + "3616 Géographie moderne Diderot 12 4 4 ... \n", + "13155 Géographie Diderot 34 6 4 ... \n", + "8738 Géographie Diderot 28 5 4 ... \n", + "1678 Géographie Diderot 13 2 2 ... \n", + "\n", + " nb ENE nb ENE Place nb ENE Person nb EN geocoded \\\n", + "4962 2 2 0 3 \n", + "3616 2 2 0 3 \n", + "13155 3 3 0 4 \n", + "8738 3 3 0 3 \n", + "1678 1 1 0 1 \n", + "\n", + " nb EN EDDA geocoded type latlong latlong value de France \\\n", + "4962 3 hydronyme False NaN True \n", + "3616 3 ville False NaN True \n", + "13155 3 ville True NaN True \n", + "8738 3 ville True NaN True \n", + "1678 1 ville False NaN True \n", + "\n", + " classifieur de France \n", + "4962 riviere \n", + "3616 ville \n", + "13155 ville \n", + "8738 ville \n", + "1678 ville \n", + "\n", + "[5 rows x 21 columns]" + ] + }, + "execution_count": 14, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_diderot.head()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.2. Articles de géographie signés par Jaucourt" + ] + }, + { + "cell_type": "code", + "execution_count": 15, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(716, 21)" + ] + }, + "execution_count": 15, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_jaucourt = data_france_cl[(data_france_cl['author'] == 'Jaucourt')]\n", + "data_jaucourt.shape\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.3. Articles de géographie signés par un autre auteur" + ] + }, + { + "cell_type": "code", + "execution_count": 16, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(2, 21)" + ] + }, + "execution_count": 16, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_other = data_france_cl[(data_france_cl['author'] != 'Diderot') & (data_france_cl['author'] != 'Jaucourt') & (data_france_cl['author'] != 'unsigned') ]\n", + "data_other.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 17, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>...</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " <th>de France</th>\n", + " <th>classifieur de France</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>9134</th>\n", + " <td>volume13-126</td>\n", + " <td>13</td>\n", + " <td>126</td>\n", + " <td>PONS, (Géog. mod.) en latin Pontes, petite vil...</td>\n", + " <td>PONS</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt & Jaucourt</td>\n", + " <td>870</td>\n", + " <td>74</td>\n", + " <td>37</td>\n", + " <td>...</td>\n", + " <td>34</td>\n", + " <td>18</td>\n", + " <td>4</td>\n", + " <td>18</td>\n", + " <td>17</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6188</th>\n", + " <td>volume13-218</td>\n", + " <td>13</td>\n", + " <td>218</td>\n", + " <td>PONT-SUR-SEINE, (Géog. mod.) en latin moderne ...</td>\n", + " <td>PONT-SUR-SEINE</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt5</td>\n", + " <td>70</td>\n", + " <td>9</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>2 rows × 21 columns</p>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "9134 volume13-126 13 126 \n", + "6188 volume13-218 13 218 \n", + "\n", + " content headword \\\n", + "9134 PONS, (Géog. mod.) en latin Pontes, petite vil... PONS \n", + "6188 PONT-SUR-SEINE, (Géog. mod.) en latin moderne ... PONT-SUR-SEINE \n", + "\n", + " normClass author nb Words nb EN nb Name EDDA \\\n", + "9134 Géographie moderne Jaucourt & Jaucourt 870 74 37 \n", + "6188 Géographie moderne Jaucourt5 70 9 5 \n", + "\n", + " ... nb ENE nb ENE Place nb ENE Person nb EN geocoded \\\n", + "9134 ... 34 18 4 18 \n", + "6188 ... 4 2 0 4 \n", + "\n", + " nb EN EDDA geocoded type latlong latlong value de France \\\n", + "9134 17 NaN True NaN True \n", + "6188 3 NaN True NaN True \n", + "\n", + " classifieur de France \n", + "9134 ville \n", + "6188 ville \n", + "\n", + "[2 rows x 21 columns]" + ] + }, + "execution_count": 17, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_other" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 4.4. Articles de géographie non signés" + ] + }, + { + "cell_type": "code", + "execution_count": 27, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(525, 21)" + ] + }, + "execution_count": 27, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_unsigned = data_france_cl[(data_france_cl['author'] == 'unsigned')]\n", + "data_unsigned.shape" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 5. Echantillonnage aléatoire\n", + "\n", + "### 5.1 Calcul des quartiles" + ] + }, + { + "cell_type": "code", + "execution_count": 28, + "metadata": {}, + "outputs": [], + "source": [ + "d_q1, d_q2, d_q3 = data_diderot['nb Words'].quantile([0.25, 0.5, 0.75])\n", + "j_q1, j_q2, j_q3 = data_jaucourt['nb Words'].quantile([0.25, 0.5, 0.75])\n", + "u_q1, u_q2, u_q3 = data_unsigned['nb Words'].quantile([0.25, 0.5, 0.75])" + ] + }, + { + "cell_type": "code", + "execution_count": 29, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Diderot (172 articles) : q1 : 15.0 - q2 : 19.0 - q3 : 24.0\n", + "Jaucourt (716 articles) : q1 : 42.0 - q2 : 71.0 - q3 : 165.0\n", + "Unsigned (525 articles) : q1 : 15.0 - q2 : 21.0 - q3 : 32.0\n" + ] + } + ], + "source": [ + "print('Diderot ('+str(len(data_diderot))+' articles) : q1 : '+str(d_q1)+ ' - q2 : '+str(d_q2)+ ' - q3 : '+str(d_q3))\n", + "print('Jaucourt ('+str(len(data_jaucourt))+' articles) : q1 : '+str(j_q1)+ ' - q2 : '+str(j_q2)+ ' - q3 : '+str(j_q3))\n", + "print('Unsigned ('+str(len(data_unsigned))+' articles) : q1 : '+str(u_q1)+ ' - q2 : '+str(u_q2)+ ' - q3 : '+str(u_q3))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 30, + "metadata": {}, + "outputs": [], + "source": [ + "data_diderot_q1 = data_diderot[(data_diderot['nb Words'] < d_q1)]\n", + "data_diderot_q2 = data_diderot[(data_diderot['nb Words'] >= d_q1) & (data_diderot['nb Words'] < d_q2)]\n", + "data_diderot_q3 = data_diderot[(data_diderot['nb Words'] >= d_q2) & (data_diderot['nb Words'] < d_q3)]\n", + "data_diderot_q4 = data_diderot[(data_diderot['nb Words'] >= d_q3)]" + ] + }, + { + "cell_type": "code", + "execution_count": 31, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "36 47 40 49\n" + ] + } + ], + "source": [ + "print(str(len(data_diderot_q1)) +\" \"+ str(len(data_diderot_q2)) +\" \"+ str(len(data_diderot_q3))+\" \"+ str(len(data_diderot_q4)))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "data_jaucourt_q1 = data_jaucourt[(data_jaucourt['nb Words'] < j_q1)]\n", + "data_jaucourt_q2 = data_jaucourt[(data_jaucourt['nb Words'] >= j_q1) & (data_jaucourt['nb Words'] < j_q2)]\n", + "data_jaucourt_q3 = data_jaucourt[(data_jaucourt['nb Words'] >= j_q2) & (data_jaucourt['nb Words'] < j_q3)]\n", + "data_jaucourt_q4 = data_jaucourt[(data_jaucourt['nb Words'] >= j_q3)]" + ] + }, + { + "cell_type": "code", + "execution_count": 33, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "176 178 182 180\n" + ] + } + ], + "source": [ + "print(str(len(data_jaucourt_q1)) +\" \"+ str(len(data_jaucourt_q2)) +\" \"+ str(len(data_jaucourt_q3))+\" \"+ str(len(data_jaucourt_q4)))\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [], + "source": [ + "data_unsigned_q1 = data_unsigned[(data_unsigned['nb Words'] <= u_q1)]\n", + "data_unsigned_q2 = data_unsigned[(data_unsigned['nb Words'] >= u_q1) & (data_unsigned['nb Words'] < u_q2)]\n", + "data_unsigned_q3 = data_unsigned[(data_unsigned['nb Words'] >= u_q2) & (data_unsigned['nb Words'] < u_q3)]\n", + "data_unsigned_q4 = data_unsigned[(data_unsigned['nb Words'] >= u_q3)]" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "143 132 134 134\n" + ] + } + ], + "source": [ + "print(str(len(data_unsigned_q1)) +\" \"+ str(len(data_unsigned_q2)) +\" \"+ str(len(data_unsigned_q3))+\" \"+ str(len(data_unsigned_q4)))\n" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.2 Sélection aléatoire par sous-groupe" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [], + "source": [ + "s_small = 4\n", + "s_big = 10\n", + "\n", + "sample10_diderot_q1 = data_diderot_q1.sample(10)\n", + "sample10_diderot_q2 = data_diderot_q2.sample(10)\n", + "sample10_diderot_q3 = data_diderot_q3.sample(10)\n", + "sample10_diderot_q4 = data_diderot_q4.sample(10)\n", + "\n", + "sample5_diderot_q1 = sample10_diderot_q1.sample(s_small)\n", + "sample5_diderot_q2 = sample10_diderot_q2.sample(s_small)\n", + "sample5_diderot_q3 = sample10_diderot_q3.sample(s_small)\n", + "sample5_diderot_q4 = sample10_diderot_q4.sample(s_small)\n", + "\n", + "sample10_jaucourt_q1 = data_jaucourt_q1.sample(10)\n", + "sample10_jaucourt_q2 = data_jaucourt_q2.sample(10)\n", + "sample10_jaucourt_q3 = data_jaucourt_q3.sample(10)\n", + "sample10_jaucourt_q4 = data_jaucourt_q4.sample(10)\n", + "\n", + "sample5_jaucourt_q1 = sample10_jaucourt_q1.sample(s_small)\n", + "sample5_jaucourt_q2 = sample10_jaucourt_q2.sample(s_small)\n", + "sample5_jaucourt_q3 = sample10_jaucourt_q3.sample(s_small)\n", + "sample5_jaucourt_q4 = sample10_jaucourt_q4.sample(s_small)\n", + "\n", + "sample10_unsigned_q1 = data_unsigned_q1.sample(10)\n", + "sample10_unsigned_q2 = data_unsigned_q2.sample(10)\n", + "sample10_unsigned_q3 = data_unsigned_q3.sample(10)\n", + "sample10_unsigned_q4 = data_unsigned_q4.sample(10)\n", + "\n", + "sample5_unsigned_q1 = sample10_unsigned_q1.sample(s_small)\n", + "sample5_unsigned_q2 = sample10_unsigned_q2.sample(s_small)\n", + "sample5_unsigned_q3 = sample10_unsigned_q3.sample(s_small)\n", + "sample5_unsigned_q4 = sample10_unsigned_q4.sample(s_small)" + ] + }, + { + "cell_type": "code", + "execution_count": 37, + "metadata": {}, + "outputs": [], + "source": [ + "sample10_diderot = pd.concat([sample10_diderot_q1,sample10_diderot_q2,sample10_diderot_q3,sample10_diderot_q4], ignore_index=True)\n", + "sample10_jaucourt = pd.concat([sample10_jaucourt_q1,sample10_jaucourt_q2,sample10_jaucourt_q3,sample10_jaucourt_q4], ignore_index=True)\n", + "sample10_unsigned = pd.concat([sample10_unsigned_q1,sample10_unsigned_q2,sample10_unsigned_q3,sample10_unsigned_q4], ignore_index=True)\n", + "\n", + "sample10 = pd.concat([sample10_diderot, sample10_jaucourt, sample10_unsigned], ignore_index=True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 38, + "metadata": {}, + "outputs": [], + "source": [ + "sample5_diderot = pd.concat([sample5_diderot_q1,sample5_diderot_q2,sample5_diderot_q3,sample5_diderot_q4], ignore_index=True)\n", + "sample5_jaucourt = pd.concat([sample5_jaucourt_q1,sample5_jaucourt_q2,sample5_jaucourt_q3,sample5_jaucourt_q4], ignore_index=True)\n", + "sample5_unsigned = pd.concat([sample5_unsigned_q1,sample5_unsigned_q2,sample5_unsigned_q3,sample5_unsigned_q4], ignore_index=True)\n", + "\n", + "sample5 = pd.concat([sample5_diderot, sample5_jaucourt, sample5_unsigned], ignore_index=True)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 39, + "metadata": { + "scrolled": true + }, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>...</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " <th>de France</th>\n", + " <th>classifieur de France</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>0</th>\n", + " <td>volume02-1504</td>\n", + " <td>2</td>\n", + " <td>1504</td>\n", + " <td>* BEIRE, (Géog.) petite ville de France, en Bo...</td>\n", + " <td>BEIRE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>12</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1</th>\n", + " <td>volume01-2599</td>\n", + " <td>1</td>\n", + " <td>2599</td>\n", + " <td>* ANDONVILLE, (Géog. mod.) ville de France, gé...</td>\n", + " <td>ANDONVILLE</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>12</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2</th>\n", + " <td>volume01-1065</td>\n", + " <td>1</td>\n", + " <td>1065</td>\n", + " <td>* Afrique, (Géog. mod.) petite ville de France...</td>\n", + " <td>Afrique</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>12</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>3</th>\n", + " <td>volume02-1391</td>\n", + " <td>2</td>\n", + " <td>1391</td>\n", + " <td>* BEAUVOISIS ou BEAUVAISIS, (Géog.) petit pays...</td>\n", + " <td>BEAUVOISIS ou BEAUVAISIS</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>13</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>pays</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>pays</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4</th>\n", + " <td>volume01-4843</td>\n", + " <td>1</td>\n", + " <td>4843</td>\n", + " <td>* AUBETERRE (Géog.) ville de France, dans l'An...</td>\n", + " <td>AUBETERRE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>17</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5</th>\n", + " <td>volume01-3831</td>\n", + " <td>1</td>\n", + " <td>3831</td>\n", + " <td>* ARGENCES, (Géog.) bourg de France en basse N...</td>\n", + " <td>ARGENCES</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>17</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>bourg</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6</th>\n", + " <td>volume01-5034</td>\n", + " <td>1</td>\n", + " <td>5034</td>\n", + " <td>* AUNEAU (Géographie.) petite ville de France,...</td>\n", + " <td>AUNEAU</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>16</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7</th>\n", + " <td>volume01-3079</td>\n", + " <td>1</td>\n", + " <td>3079</td>\n", + " <td>* ANTRAIN ou ENTRAINS, (Géog. mod.) petite vil...</td>\n", + " <td>ANTRAIN ou ENTRAINS</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>15</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>8</th>\n", + " <td>volume02-363</td>\n", + " <td>2</td>\n", + " <td>363</td>\n", + " <td>* BALLON (Géog.) ville de France, au diocese d...</td>\n", + " <td>BALLON</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>22</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>9</th>\n", + " <td>volume01-1279</td>\n", + " <td>1</td>\n", + " <td>1279</td>\n", + " <td>* Aigle, (Géog.) petite ville de France dans l...</td>\n", + " <td>Aigle</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>19</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>10</th>\n", + " <td>volume02-705</td>\n", + " <td>2</td>\n", + " <td>705</td>\n", + " <td>* BARENTON (Géog.) petite ville de France, dan...</td>\n", + " <td>BARENTON</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>20</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>11</th>\n", + " <td>volume01-2810</td>\n", + " <td>1</td>\n", + " <td>2810</td>\n", + " <td>* ANNONAY, (Géog. mod.) petite ville de France...</td>\n", + " <td>ANNONAY</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>20</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>12</th>\n", + " <td>volume02-1614</td>\n", + " <td>2</td>\n", + " <td>1614</td>\n", + " <td>* BENAUGE, (Géog.) petite contrée de la Guienn...</td>\n", + " <td>BENAUGE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>24</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>pays</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>province</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13</th>\n", + " <td>volume01-1565</td>\n", + " <td>1</td>\n", + " <td>1565</td>\n", + " <td>* ALBI, (Géog.) ville de France, capitale de ...</td>\n", + " <td>ALBI</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>25</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>14</th>\n", + " <td>volume01-5144</td>\n", + " <td>1</td>\n", + " <td>5144</td>\n", + " <td>* AUTUN, (Géog.) ville de France au duché de B...</td>\n", + " <td>AUTUN</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>27</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>15</th>\n", + " <td>volume02-1564</td>\n", + " <td>2</td>\n", + " <td>1564</td>\n", + " <td>* BELLE-ISLE, (Géog.) île de France à six lieu...</td>\n", + " <td>BELLE-ISLE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>28</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>île</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>île</td>\n", + " </tr>\n", + " <tr>\n", + " <th>16</th>\n", + " <td>volume15-2668</td>\n", + " <td>15</td>\n", + " <td>2668</td>\n", + " <td>STRENGENBACH ou STRENGBACH, le, (Géog. mod.) r...</td>\n", + " <td>STRENGENBACH ou STRENGBACH, le</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>30</td>\n", + " <td>6</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>hydronyme</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>riviere</td>\n", + " </tr>\n", + " <tr>\n", + " <th>17</th>\n", + " <td>volume15-4368</td>\n", + " <td>15</td>\n", + " <td>4368</td>\n", + " <td>TARDÉNOIS, le (Géog. mod.) en latin du moyen â...</td>\n", + " <td>TARDÉNOIS, le</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>34</td>\n", + " <td>7</td>\n", + " <td>6</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>pays</td>\n", + " </tr>\n", + " <tr>\n", + " <th>18</th>\n", + " <td>volume07-2017</td>\n", + " <td>7</td>\n", + " <td>2017</td>\n", + " <td>Germain-Laval, (Saint-) Géog. ville de France ...</td>\n", + " <td>Germain-Laval, (Saint-)</td>\n", + " <td>Géographie</td>\n", + " <td>Jaucourt</td>\n", + " <td>38</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>19</th>\n", + " <td>volume16-4274</td>\n", + " <td>16</td>\n", + " <td>4274</td>\n", + " <td>Valence, (Géog. mod.) nos géographes disent pe...</td>\n", + " <td>Valence</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>32</td>\n", + " <td>5</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>20</th>\n", + " <td>volume10-381</td>\n", + " <td>10</td>\n", + " <td>381</td>\n", + " <td>MARCELLIN, S. (Géog.) petite ville de France e...</td>\n", + " <td>MARCELLIN</td>\n", + " <td>Géographie</td>\n", + " <td>Jaucourt</td>\n", + " <td>55</td>\n", + " <td>9</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>21</th>\n", + " <td>volume13-514</td>\n", + " <td>13</td>\n", + " <td>514</td>\n", + " <td>PORTO-CROS, (Géog. mod.) petite île de France ...</td>\n", + " <td>PORTO-CROS</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>48</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>île</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>île</td>\n", + " </tr>\n", + " <tr>\n", + " <th>22</th>\n", + " <td>volume12-3457</td>\n", + " <td>12</td>\n", + " <td>3457</td>\n", + " <td>PLOERMEL, (Géog. mod.) petite ville de France ...</td>\n", + " <td>PLOERMEL</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>46</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>23</th>\n", + " <td>volume14-2693</td>\n", + " <td>14</td>\n", + " <td>2693</td>\n", + " <td>RUFFEC, (Géog. mod.) petite ville de France, d...</td>\n", + " <td>RUFFEC</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>47</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>24</th>\n", + " <td>volume17-1439</td>\n", + " <td>17</td>\n", + " <td>1439</td>\n", + " <td>VODABLE, (Géog. mod.) bourg de France dans l'A...</td>\n", + " <td>VODABLE</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>76</td>\n", + " <td>8</td>\n", + " <td>7</td>\n", + " <td>...</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>6</td>\n", + " <td>6</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>bourg</td>\n", + " </tr>\n", + " <tr>\n", + " <th>25</th>\n", + " <td>volume13-208</td>\n", + " <td>13</td>\n", + " <td>208</td>\n", + " <td>PONTIVY, (Géog. mod.) petite ville de France, ...</td>\n", + " <td>PONTIVY</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>145</td>\n", + " <td>15</td>\n", + " <td>10</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>11</td>\n", + " <td>9</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>26</th>\n", + " <td>volume09-524</td>\n", + " <td>9</td>\n", + " <td>524</td>\n", + " <td>KAYSERBERG, (Géog.) c'est-à -dire mont de l'emp...</td>\n", + " <td>KAYSERBERG</td>\n", + " <td>Géographie</td>\n", + " <td>Jaucourt</td>\n", + " <td>130</td>\n", + " <td>17</td>\n", + " <td>7</td>\n", + " <td>...</td>\n", + " <td>7</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>6</td>\n", + " <td>4</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>27</th>\n", + " <td>volume11-1060</td>\n", + " <td>11</td>\n", + " <td>1060</td>\n", + " <td>Nogent-le-Rotrou, (Géog.) gros bourg de France...</td>\n", + " <td>Nogent-le-Rotrou</td>\n", + " <td>Géographie</td>\n", + " <td>Jaucourt</td>\n", + " <td>123</td>\n", + " <td>17</td>\n", + " <td>7</td>\n", + " <td>...</td>\n", + " <td>9</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>bourg</td>\n", + " </tr>\n", + " <tr>\n", + " <th>28</th>\n", + " <td>volume12-2242</td>\n", + " <td>12</td>\n", + " <td>2242</td>\n", + " <td>PICARDIE, la, (Géog. mod.) province de France,...</td>\n", + " <td>PICARDIE, la</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>432</td>\n", + " <td>60</td>\n", + " <td>35</td>\n", + " <td>...</td>\n", + " <td>14</td>\n", + " <td>5</td>\n", + " <td>3</td>\n", + " <td>16</td>\n", + " <td>11</td>\n", + " <td>pays</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>province</td>\n", + " </tr>\n", + " <tr>\n", + " <th>29</th>\n", + " <td>volume11-3735</td>\n", + " <td>11</td>\n", + " <td>3735</td>\n", + " <td>Palais, (Géograph. mod.) petite place forte de...</td>\n", + " <td>Palais</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>515</td>\n", + " <td>42</td>\n", + " <td>17</td>\n", + " <td>...</td>\n", + " <td>16</td>\n", + " <td>6</td>\n", + " <td>2</td>\n", + " <td>8</td>\n", + " <td>7</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>place</td>\n", + " </tr>\n", + " <tr>\n", + " <th>30</th>\n", + " <td>volume14-2568</td>\n", + " <td>14</td>\n", + " <td>2568</td>\n", + " <td>ROUSSILLON, le, (Géog. mod.) en latin Ruscinon...</td>\n", + " <td>ROUSSILLON, le</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>554</td>\n", + " <td>54</td>\n", + " <td>35</td>\n", + " <td>...</td>\n", + " <td>22</td>\n", + " <td>9</td>\n", + " <td>3</td>\n", + " <td>16</td>\n", + " <td>14</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>province</td>\n", + " </tr>\n", + " <tr>\n", + " <th>31</th>\n", + " <td>volume17-636</td>\n", + " <td>17</td>\n", + " <td>636</td>\n", + " <td>Vic-le-comte, (Géog. mod.) petite ville de Fra...</td>\n", + " <td>Vic-le-comte</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>274</td>\n", + " <td>19</td>\n", + " <td>9</td>\n", + " <td>...</td>\n", + " <td>9</td>\n", + " <td>3</td>\n", + " <td>1</td>\n", + " <td>8</td>\n", + " <td>4</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>32</th>\n", + " <td>volume02-6579</td>\n", + " <td>2</td>\n", + " <td>6579</td>\n", + " <td>CERNIN, (Saint) Géog. petite ville de France, ...</td>\n", + " <td>CERNIN (Saint)</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>10</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>33</th>\n", + " <td>volume02-4075</td>\n", + " <td>2</td>\n", + " <td>4075</td>\n", + " <td>Bruges, (Géog.) petite ville de France, dans l...</td>\n", + " <td>Bruges</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>14</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>34</th>\n", + " <td>volume02-4564</td>\n", + " <td>2</td>\n", + " <td>4564</td>\n", + " <td>CADENAC, (Géog.) petite ville de France dans l...</td>\n", + " <td>CADENAC</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>14</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>35</th>\n", + " <td>volume02-6305</td>\n", + " <td>2</td>\n", + " <td>6305</td>\n", + " <td>CAYLAR, (le) Géog. petite ville de France, dan...</td>\n", + " <td>CAYLAR (le)</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>12</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>36</th>\n", + " <td>volume03-3769</td>\n", + " <td>3</td>\n", + " <td>3769</td>\n", + " <td>CONDOM, (Géog. mod.) ville de France en Gascog...</td>\n", + " <td>CONDOM</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>19</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>37</th>\n", + " <td>volume17-2104</td>\n", + " <td>17</td>\n", + " <td>2104</td>\n", + " <td>WASSELONNE, (Géog. mod.) bourg ou petite ville...</td>\n", + " <td>WASSELONNE</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>19</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>bourg</td>\n", + " </tr>\n", + " <tr>\n", + " <th>38</th>\n", + " <td>volume04-2909</td>\n", + " <td>4</td>\n", + " <td>2909</td>\n", + " <td>CUSSET, (Géog. mod.) petite ville de France en...</td>\n", + " <td>CUSSET</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>15</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>39</th>\n", + " <td>volume02-3396</td>\n", + " <td>2</td>\n", + " <td>3396</td>\n", + " <td>BOUTONNE, (Géog.) riviere de France, qui prend...</td>\n", + " <td>BOUTONNE</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>18</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>hydronyme</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>riviere</td>\n", + " </tr>\n", + " <tr>\n", + " <th>40</th>\n", + " <td>volume03-2391</td>\n", + " <td>3</td>\n", + " <td>2391</td>\n", + " <td>Clermont, (Géog. mod.) petite ville de France,...</td>\n", + " <td>Clermont</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>28</td>\n", + " <td>7</td>\n", + " <td>5</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>41</th>\n", + " <td>volume02-3250</td>\n", + " <td>2</td>\n", + " <td>3250</td>\n", + " <td>Bourg-en-Bresse, (Géog.) ville de France, capi...</td>\n", + " <td>Bourg-en-Bresse</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>27</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>42</th>\n", + " <td>volume09-2586</td>\n", + " <td>9</td>\n", + " <td>2586</td>\n", + " <td>LIMOURS, (Géog.) petite ville de France dans l...</td>\n", + " <td>LIMOURS</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>26</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>43</th>\n", + " <td>volume17-575</td>\n", + " <td>17</td>\n", + " <td>575</td>\n", + " <td>VEUDRE, (Géog. mod.) petite ville ou bourg de ...</td>\n", + " <td>VEUDRE</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>23</td>\n", + " <td>4</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>44</th>\n", + " <td>volume10-3018</td>\n", + " <td>10</td>\n", + " <td>3018</td>\n", + " <td>MONT-TRICHARD, (Géog.) ancienne petite ville d...</td>\n", + " <td>MONT-TRICHARD</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>42</td>\n", + " <td>7</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>45</th>\n", + " <td>volume14-4758</td>\n", + " <td>14</td>\n", + " <td>4758</td>\n", + " <td>SECLIN, (Géog. mod.) en latin moderne Sacilium...</td>\n", + " <td>SECLIN</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>46</td>\n", + " <td>7</td>\n", + " <td>3</td>\n", + " <td>...</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>bourg</td>\n", + " </tr>\n", + " <tr>\n", + " <th>46</th>\n", + " <td>volume10-1805</td>\n", + " <td>10</td>\n", + " <td>1805</td>\n", + " <td>MERY-SUR-SEINE, (Géog.) petite ville de France...</td>\n", + " <td>MERY-SUR-SEINE</td>\n", + " <td>Géographie</td>\n", + " <td>unsigned</td>\n", + " <td>36</td>\n", + " <td>6</td>\n", + " <td>4</td>\n", + " <td>...</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>ville</td>\n", + " </tr>\n", + " <tr>\n", + " <th>47</th>\n", + " <td>volume11-3255</td>\n", + " <td>11</td>\n", + " <td>3255</td>\n", + " <td>OUESSANT, (Géog. mod.) île de France dans l'Oc...</td>\n", + " <td>OUESSANT</td>\n", + " <td>Géographie moderne</td>\n", + " <td>unsigned</td>\n", + " <td>420</td>\n", + " <td>8</td>\n", + " <td>6</td>\n", + " <td>...</td>\n", + " <td>4</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>île</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>île</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>48 rows × 21 columns</p>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "0 volume02-1504 2 1504 \n", + "1 volume01-2599 1 2599 \n", + "2 volume01-1065 1 1065 \n", + "3 volume02-1391 2 1391 \n", + "4 volume01-4843 1 4843 \n", + "5 volume01-3831 1 3831 \n", + "6 volume01-5034 1 5034 \n", + "7 volume01-3079 1 3079 \n", + "8 volume02-363 2 363 \n", + "9 volume01-1279 1 1279 \n", + "10 volume02-705 2 705 \n", + "11 volume01-2810 1 2810 \n", + "12 volume02-1614 2 1614 \n", + "13 volume01-1565 1 1565 \n", + "14 volume01-5144 1 5144 \n", + "15 volume02-1564 2 1564 \n", + "16 volume15-2668 15 2668 \n", + "17 volume15-4368 15 4368 \n", + "18 volume07-2017 7 2017 \n", + "19 volume16-4274 16 4274 \n", + "20 volume10-381 10 381 \n", + "21 volume13-514 13 514 \n", + "22 volume12-3457 12 3457 \n", + "23 volume14-2693 14 2693 \n", + "24 volume17-1439 17 1439 \n", + "25 volume13-208 13 208 \n", + "26 volume09-524 9 524 \n", + "27 volume11-1060 11 1060 \n", + "28 volume12-2242 12 2242 \n", + "29 volume11-3735 11 3735 \n", + "30 volume14-2568 14 2568 \n", + "31 volume17-636 17 636 \n", + "32 volume02-6579 2 6579 \n", + "33 volume02-4075 2 4075 \n", + "34 volume02-4564 2 4564 \n", + "35 volume02-6305 2 6305 \n", + "36 volume03-3769 3 3769 \n", + "37 volume17-2104 17 2104 \n", + "38 volume04-2909 4 2909 \n", + "39 volume02-3396 2 3396 \n", + "40 volume03-2391 3 2391 \n", + "41 volume02-3250 2 3250 \n", + "42 volume09-2586 9 2586 \n", + "43 volume17-575 17 575 \n", + "44 volume10-3018 10 3018 \n", + "45 volume14-4758 14 4758 \n", + "46 volume10-1805 10 1805 \n", + "47 volume11-3255 11 3255 \n", + "\n", + " content \\\n", + "0 * BEIRE, (Géog.) petite ville de France, en Bo... \n", + "1 * ANDONVILLE, (Géog. mod.) ville de France, gé... \n", + "2 * Afrique, (Géog. mod.) petite ville de France... \n", + "3 * BEAUVOISIS ou BEAUVAISIS, (Géog.) petit pays... \n", + "4 * AUBETERRE (Géog.) ville de France, dans l'An... \n", + "5 * ARGENCES, (Géog.) bourg de France en basse N... \n", + "6 * AUNEAU (Géographie.) petite ville de France,... \n", + "7 * ANTRAIN ou ENTRAINS, (Géog. mod.) petite vil... \n", + "8 * BALLON (Géog.) ville de France, au diocese d... \n", + "9 * Aigle, (Géog.) petite ville de France dans l... \n", + "10 * BARENTON (Géog.) petite ville de France, dan... \n", + "11 * ANNONAY, (Géog. mod.) petite ville de France... \n", + "12 * BENAUGE, (Géog.) petite contrée de la Guienn... \n", + "13 * ALBI, (Géog.) ville de France, capitale de ... \n", + "14 * AUTUN, (Géog.) ville de France au duché de B... \n", + "15 * BELLE-ISLE, (Géog.) île de France à six lieu... \n", + "16 STRENGENBACH ou STRENGBACH, le, (Géog. mod.) r... \n", + "17 TARDÉNOIS, le (Géog. mod.) en latin du moyen â... \n", + "18 Germain-Laval, (Saint-) Géog. ville de France ... \n", + "19 Valence, (Géog. mod.) nos géographes disent pe... \n", + "20 MARCELLIN, S. (Géog.) petite ville de France e... \n", + "21 PORTO-CROS, (Géog. mod.) petite île de France ... \n", + "22 PLOERMEL, (Géog. mod.) petite ville de France ... \n", + "23 RUFFEC, (Géog. mod.) petite ville de France, d... \n", + "24 VODABLE, (Géog. mod.) bourg de France dans l'A... \n", + "25 PONTIVY, (Géog. mod.) petite ville de France, ... \n", + "26 KAYSERBERG, (Géog.) c'est-à -dire mont de l'emp... \n", + "27 Nogent-le-Rotrou, (Géog.) gros bourg de France... \n", + "28 PICARDIE, la, (Géog. mod.) province de France,... \n", + "29 Palais, (Géograph. mod.) petite place forte de... \n", + "30 ROUSSILLON, le, (Géog. mod.) en latin Ruscinon... \n", + "31 Vic-le-comte, (Géog. mod.) petite ville de Fra... \n", + "32 CERNIN, (Saint) Géog. petite ville de France, ... \n", + "33 Bruges, (Géog.) petite ville de France, dans l... \n", + "34 CADENAC, (Géog.) petite ville de France dans l... \n", + "35 CAYLAR, (le) Géog. petite ville de France, dan... \n", + "36 CONDOM, (Géog. mod.) ville de France en Gascog... \n", + "37 WASSELONNE, (Géog. mod.) bourg ou petite ville... \n", + "38 CUSSET, (Géog. mod.) petite ville de France en... \n", + "39 BOUTONNE, (Géog.) riviere de France, qui prend... \n", + "40 Clermont, (Géog. mod.) petite ville de France,... \n", + "41 Bourg-en-Bresse, (Géog.) ville de France, capi... \n", + "42 LIMOURS, (Géog.) petite ville de France dans l... \n", + "43 VEUDRE, (Géog. mod.) petite ville ou bourg de ... \n", + "44 MONT-TRICHARD, (Géog.) ancienne petite ville d... \n", + "45 SECLIN, (Géog. mod.) en latin moderne Sacilium... \n", + "46 MERY-SUR-SEINE, (Géog.) petite ville de France... \n", + "47 OUESSANT, (Géog. mod.) île de France dans l'Oc... \n", + "\n", + " headword normClass author nb Words \\\n", + "0 BEIRE Géographie Diderot 12 \n", + "1 ANDONVILLE Géographie moderne Diderot 12 \n", + "2 Afrique Géographie moderne Diderot 12 \n", + "3 BEAUVOISIS ou BEAUVAISIS Géographie Diderot 13 \n", + "4 AUBETERRE Géographie Diderot 17 \n", + "5 ARGENCES Géographie Diderot 17 \n", + "6 AUNEAU Géographie Diderot 16 \n", + "7 ANTRAIN ou ENTRAINS Géographie moderne Diderot 15 \n", + "8 BALLON Géographie Diderot 22 \n", + "9 Aigle Géographie Diderot 19 \n", + "10 BARENTON Géographie Diderot 20 \n", + "11 ANNONAY Géographie moderne Diderot 20 \n", + "12 BENAUGE Géographie Diderot 24 \n", + "13 ALBI Géographie Diderot 25 \n", + "14 AUTUN Géographie Diderot 27 \n", + "15 BELLE-ISLE Géographie Diderot 28 \n", + "16 STRENGENBACH ou STRENGBACH, le Géographie moderne Jaucourt 30 \n", + "17 TARDÉNOIS, le Géographie moderne Jaucourt 34 \n", + "18 Germain-Laval, (Saint-) Géographie Jaucourt 38 \n", + "19 Valence Géographie moderne Jaucourt 32 \n", + "20 MARCELLIN Géographie Jaucourt 55 \n", + "21 PORTO-CROS Géographie moderne Jaucourt 48 \n", + "22 PLOERMEL Géographie moderne Jaucourt 46 \n", + "23 RUFFEC Géographie moderne Jaucourt 47 \n", + "24 VODABLE Géographie moderne Jaucourt 76 \n", + "25 PONTIVY Géographie moderne Jaucourt 145 \n", + "26 KAYSERBERG Géographie Jaucourt 130 \n", + "27 Nogent-le-Rotrou Géographie Jaucourt 123 \n", + "28 PICARDIE, la Géographie moderne Jaucourt 432 \n", + "29 Palais Géographie moderne Jaucourt 515 \n", + "30 ROUSSILLON, le Géographie moderne Jaucourt 554 \n", + "31 Vic-le-comte Géographie moderne Jaucourt 274 \n", + "32 CERNIN (Saint) Géographie unsigned 10 \n", + "33 Bruges Géographie unsigned 14 \n", + "34 CADENAC Géographie unsigned 14 \n", + "35 CAYLAR (le) Géographie unsigned 12 \n", + "36 CONDOM Géographie moderne unsigned 19 \n", + "37 WASSELONNE Géographie moderne unsigned 19 \n", + "38 CUSSET Géographie moderne unsigned 15 \n", + "39 BOUTONNE Géographie unsigned 18 \n", + "40 Clermont Géographie moderne unsigned 28 \n", + "41 Bourg-en-Bresse Géographie unsigned 27 \n", + "42 LIMOURS Géographie unsigned 26 \n", + "43 VEUDRE Géographie moderne unsigned 23 \n", + "44 MONT-TRICHARD Géographie unsigned 42 \n", + "45 SECLIN Géographie moderne unsigned 46 \n", + "46 MERY-SUR-SEINE Géographie unsigned 36 \n", + "47 OUESSANT Géographie moderne unsigned 420 \n", + "\n", + " nb EN nb Name EDDA ... nb ENE nb ENE Place nb ENE Person \\\n", + "0 4 4 ... 2 2 0 \n", + "1 4 4 ... 3 2 0 \n", + "2 4 4 ... 2 2 0 \n", + "3 4 4 ... 1 1 0 \n", + "4 4 2 ... 1 1 0 \n", + "5 3 3 ... 1 1 0 \n", + "6 4 4 ... 2 2 0 \n", + "7 5 5 ... 2 2 0 \n", + "8 4 4 ... 3 2 0 \n", + "9 5 5 ... 2 2 0 \n", + "10 5 5 ... 3 3 0 \n", + "11 3 2 ... 1 1 0 \n", + "12 5 5 ... 4 4 0 \n", + "13 4 4 ... 1 1 0 \n", + "14 4 3 ... 2 2 0 \n", + "15 4 3 ... 3 3 0 \n", + "16 6 3 ... 1 1 0 \n", + "17 7 6 ... 2 1 1 \n", + "18 4 2 ... 2 2 0 \n", + "19 5 3 ... 2 2 0 \n", + "20 9 5 ... 3 2 1 \n", + "21 5 4 ... 3 2 0 \n", + "22 5 4 ... 2 2 0 \n", + "23 4 3 ... 1 1 0 \n", + "24 8 7 ... 4 3 0 \n", + "25 15 10 ... 3 3 0 \n", + "26 17 7 ... 7 2 2 \n", + "27 17 7 ... 9 2 2 \n", + "28 60 35 ... 14 5 3 \n", + "29 42 17 ... 16 6 2 \n", + "30 54 35 ... 22 9 3 \n", + "31 19 9 ... 9 3 1 \n", + "32 3 3 ... 1 1 0 \n", + "33 4 2 ... 2 1 0 \n", + "34 4 3 ... 2 2 0 \n", + "35 3 3 ... 2 2 0 \n", + "36 5 4 ... 2 2 0 \n", + "37 4 2 ... 1 1 0 \n", + "38 3 3 ... 1 1 0 \n", + "39 4 3 ... 1 1 0 \n", + "40 7 5 ... 1 1 0 \n", + "41 4 2 ... 3 3 0 \n", + "42 5 4 ... 3 2 1 \n", + "43 4 4 ... 1 1 0 \n", + "44 7 3 ... 2 1 1 \n", + "45 7 3 ... 4 2 0 \n", + "46 6 4 ... 2 1 0 \n", + "47 8 6 ... 4 3 0 \n", + "\n", + " nb EN geocoded nb EN EDDA geocoded type latlong latlong value \\\n", + "0 2 2 ville False NaN \n", + "1 3 3 ville False NaN \n", + "2 3 3 ville False NaN \n", + "3 2 2 pays False NaN \n", + "4 3 2 ville True NaN \n", + "5 2 2 ville True NaN \n", + "6 4 4 ville False NaN \n", + "7 3 3 ville False NaN \n", + "8 1 1 ville True NaN \n", + "9 4 4 ville False NaN \n", + "10 5 5 ville False NaN \n", + "11 2 2 ville True NaN \n", + "12 3 3 pays False NaN \n", + "13 2 2 ville True NaN \n", + "14 2 2 ville True NaN \n", + "15 3 3 île False NaN \n", + "16 4 2 hydronyme False NaN \n", + "17 2 2 NaN False NaN \n", + "18 2 2 ville True NaN \n", + "19 2 2 NaN False NaN \n", + "20 2 2 ville True NaN \n", + "21 3 2 île False NaN \n", + "22 4 4 ville True NaN \n", + "23 2 2 ville True NaN \n", + "24 6 6 ville True NaN \n", + "25 11 9 ville True NaN \n", + "26 6 4 NaN True NaN \n", + "27 4 3 ville True NaN \n", + "28 16 11 pays False NaN \n", + "29 8 7 NaN True NaN \n", + "30 16 14 NaN False NaN \n", + "31 8 4 ville True NaN \n", + "32 2 2 ville False NaN \n", + "33 3 2 ville False NaN \n", + "34 1 1 ville False NaN \n", + "35 1 1 ville False NaN \n", + "36 3 2 ville True NaN \n", + "37 3 2 ville False NaN \n", + "38 2 2 ville True NaN \n", + "39 2 2 hydronyme False NaN \n", + "40 4 3 ville False NaN \n", + "41 3 2 ville True NaN \n", + "42 3 3 ville True NaN \n", + "43 1 1 ville False NaN \n", + "44 2 2 NaN True NaN \n", + "45 4 2 NaN False NaN \n", + "46 4 3 ville True NaN \n", + "47 5 5 île True NaN \n", + "\n", + " de France classifieur de France \n", + "0 True ville \n", + "1 True ville \n", + "2 True ville \n", + "3 True pays \n", + "4 True ville \n", + "5 True bourg \n", + "6 True ville \n", + "7 True ville \n", + "8 True ville \n", + "9 True ville \n", + "10 True ville \n", + "11 True ville \n", + "12 True province \n", + "13 True ville \n", + "14 True ville \n", + "15 True île \n", + "16 True riviere \n", + "17 True pays \n", + "18 True ville \n", + "19 True ville \n", + "20 True ville \n", + "21 True île \n", + "22 True ville \n", + "23 True ville \n", + "24 True bourg \n", + "25 True ville \n", + "26 True ville \n", + "27 True bourg \n", + "28 True province \n", + "29 True place \n", + "30 True province \n", + "31 True ville \n", + "32 True ville \n", + "33 True ville \n", + "34 True ville \n", + "35 True ville \n", + "36 True ville \n", + "37 True bourg \n", + "38 True ville \n", + "39 True riviere \n", + "40 True ville \n", + "41 True ville \n", + "42 True ville \n", + "43 True ville \n", + "44 True ville \n", + "45 True bourg \n", + "46 True ville \n", + "47 True île \n", + "\n", + "[48 rows x 21 columns]" + ] + }, + "execution_count": 39, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample5" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### 5.3 Enregistrement des résultats" + ] + }, + { + "cell_type": "code", + "execution_count": 40, + "metadata": {}, + "outputs": [], + "source": [ + "sample5.to_csv('../Data/FranceGEOArticles-Sample4-23.08.17.tsv', sep='\\t', index=False)\n", + "#sample10.to_csv('../Data/FranceGEOArticles-Sample10-21.08.17.tsv', sep='\\t', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Liste des 100 articles de géographie les plus longs de Diderot" + ] + }, + { + "cell_type": "code", + "execution_count": 55, + "metadata": {}, + "outputs": [], + "source": [ + "domaines_geographie = ['Géographie', 'Géographie moderne',\n", + " 'Géographie ancienne', 'Géographie moderne | Géographie ancienne',\n", + " 'Géographie ancienne | Géographie moderne', 'Géographie sacrée', 'Géographie sainte',\n", + " 'Géographie | Histoire ancienne', 'Géographie historique', 'Géographie | Histoire',\n", + " 'Histoire | Géographie', 'Géographie | Histoire naturelle', 'Géographie | Mythologie',\n", + " 'Géographie ancienne | Mythologie', 'Histoire moderne | Géographie',\n", + " 'Géographie ancienne | Géographie sainte', 'Géographie ancienne | Géographie sacrée',\n", + " 'Géographie sacrée | Géographie ancienne', 'Géographie du moyen âge', 'Géographie des Arabes',\n", + " 'Géographie | Commerce', 'Histoire | Géographie ancienne',\n", + " 'Géographie | Histoire ancienne | Histoire moderne', 'Géographie ancienne | Littérature | Histoire',\n", + " 'Histoire naturelle | Géographie', 'Géographie | Histoire ancienne | Mythologie',\n", + " 'Géographie moderne | Commerce', 'Géographie ancienne | Géographie antique',\n", + " 'Géographie moderne | Histoire', 'Géographie | Histoire monastique',\n", + " 'Géographie ancienne | Géographie moderne | Mythologie', 'Géographie ancienne | Histoire',\n", + " 'Géographie ancienne | Littérature | Mythologie', 'Géographie ancienne | Médailles'\n", + " ]" + ] + }, + { + "cell_type": "code", + "execution_count": 56, + "metadata": {}, + "outputs": [], + "source": [ + "boolean_series = data.normClass.isin(domaines_geographie)\n", + "filtered_df = data[boolean_series]" + ] + }, + { + "cell_type": "code", + "execution_count": 57, + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "(14452, 19)" + ] + }, + "execution_count": 57, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered_df.shape" + ] + }, + { + "cell_type": "code", + "execution_count": 58, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>1578</th>\n", + " <td>volume01-350</td>\n", + " <td>1</td>\n", + " <td>350</td>\n", + " <td>* Abyde, (Géog. anc.) ville d'Egypte.</td>\n", + " <td>Abyde</td>\n", + " <td>Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>6</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>12894</th>\n", + " <td>volume01-539</td>\n", + " <td>1</td>\n", + " <td>539</td>\n", + " <td>ACÉ, s. f. (Geog. anc.) ville de Phénicie. Voy...</td>\n", + " <td>ACÉ</td>\n", + " <td>Géographie ancienne</td>\n", + " <td>unsigned</td>\n", + " <td>10</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>20659</th>\n", + " <td>volume01-561</td>\n", + " <td>1</td>\n", + " <td>561</td>\n", + " <td>* ACHAIE, s. m. (Geog. anc.) C'est le nom d'un...</td>\n", + " <td>ACHAIE</td>\n", + " <td>Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>46</td>\n", + " <td>7</td>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>17606</th>\n", + " <td>volume01-581</td>\n", + " <td>1</td>\n", + " <td>581</td>\n", + " <td>* ACHERON, s. m. (Géog. anc. & Myth.) C'étoit ...</td>\n", + " <td>ACHERON</td>\n", + " <td>Géographie ancienne | Mythologie</td>\n", + " <td>Diderot</td>\n", + " <td>49</td>\n", + " <td>7</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2407</th>\n", + " <td>volume01-582</td>\n", + " <td>1</td>\n", + " <td>582</td>\n", + " <td>* ACHERUSE, s. f. (Géog. Hist. anc. & Myth.) l...</td>\n", + " <td>ACHERUSE</td>\n", + " <td>Géographie | Histoire ancienne | Mythologie</td>\n", + " <td>Diderot</td>\n", + " <td>112</td>\n", + " <td>7</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>hydronyme</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "1578 volume01-350 1 350 \n", + "12894 volume01-539 1 539 \n", + "20659 volume01-561 1 561 \n", + "17606 volume01-581 1 581 \n", + "2407 volume01-582 1 582 \n", + "\n", + " content headword \\\n", + "1578 * Abyde, (Géog. anc.) ville d'Egypte. Abyde \n", + "12894 ACÉ, s. f. (Geog. anc.) ville de Phénicie. Voy... ACÉ \n", + "20659 * ACHAIE, s. m. (Geog. anc.) C'est le nom d'un... ACHAIE \n", + "17606 * ACHERON, s. m. (Géog. anc. & Myth.) C'étoit ... ACHERON \n", + "2407 * ACHERUSE, s. f. (Géog. Hist. anc. & Myth.) l... ACHERUSE \n", + "\n", + " normClass author nb Words nb EN \\\n", + "1578 Géographie ancienne Diderot 6 2 \n", + "12894 Géographie ancienne unsigned 10 2 \n", + "20659 Géographie ancienne Diderot 46 7 \n", + "17606 Géographie ancienne | Mythologie Diderot 49 7 \n", + "2407 Géographie | Histoire ancienne | Mythologie Diderot 112 7 \n", + "\n", + " nb Name EDDA nb Person nb ENE nb ENE Place nb ENE Person \\\n", + "1578 2 0 1 1 0 \n", + "12894 1 0 1 1 0 \n", + "20659 5 0 3 3 0 \n", + "17606 6 0 3 3 0 \n", + "2407 4 1 1 1 0 \n", + "\n", + " nb EN geocoded nb EN EDDA geocoded type latlong latlong value \n", + "1578 0 0 ville False NaN \n", + "12894 0 0 ville False NaN \n", + "20659 0 0 NaN False NaN \n", + "17606 0 0 NaN False NaN \n", + "2407 0 0 hydronyme False NaN " + ] + }, + "execution_count": 58, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "filtered_df.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 59, + "metadata": {}, + "outputs": [], + "source": [ + "data_diderot = filtered_df[(filtered_df['author'] == 'Diderot')]\n", + "data_jaucourt = filtered_df[(filtered_df['author'] == 'Jaucourt')]" + ] + }, + { + "cell_type": "code", + "execution_count": 60, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(1250, 19)\n", + "(8287, 19)\n" + ] + } + ], + "source": [ + "print(data_diderot.shape)\n", + "print(data_jaucourt.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 11, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>913</th>\n", + " <td>volume01-350</td>\n", + " <td>1</td>\n", + " <td>350</td>\n", + " <td>* Abyde, (Géog. anc.) ville d'Egypte.</td>\n", + " <td>Abyde</td>\n", + " <td>Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>6</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>11935</th>\n", + " <td>volume01-561</td>\n", + " <td>1</td>\n", + " <td>561</td>\n", + " <td>* ACHAIE, s. m. (Geog. anc.) C'est le nom d'un...</td>\n", + " <td>ACHAIE</td>\n", + " <td>Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>46</td>\n", + " <td>7</td>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>10142</th>\n", + " <td>volume01-581</td>\n", + " <td>1</td>\n", + " <td>581</td>\n", + " <td>* ACHERON, s. m. (Géog. anc. & Myth.) C'étoit ...</td>\n", + " <td>ACHERON</td>\n", + " <td>Géographie ancienne | Mythologie</td>\n", + " <td>Diderot</td>\n", + " <td>49</td>\n", + " <td>7</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>2</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>1400</th>\n", + " <td>volume01-582</td>\n", + " <td>1</td>\n", + " <td>582</td>\n", + " <td>* ACHERUSE, s. f. (Géog. Hist. anc. & Myth.) l...</td>\n", + " <td>ACHERUSE</td>\n", + " <td>Géographie | Histoire ancienne | Mythologie</td>\n", + " <td>Diderot</td>\n", + " <td>112</td>\n", + " <td>7</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>hydronyme</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>14193</th>\n", + " <td>volume01-591</td>\n", + " <td>1</td>\n", + " <td>591</td>\n", + " <td>* ACHILLEA, s. f. (Géog. anc.) Isle du Pont-Eu...</td>\n", + " <td>ACHILLEA</td>\n", + " <td>Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>19</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "913 volume01-350 1 350 \n", + "11935 volume01-561 1 561 \n", + "10142 volume01-581 1 581 \n", + "1400 volume01-582 1 582 \n", + "14193 volume01-591 1 591 \n", + "\n", + " content headword \\\n", + "913 * Abyde, (Géog. anc.) ville d'Egypte. Abyde \n", + "11935 * ACHAIE, s. m. (Geog. anc.) C'est le nom d'un... ACHAIE \n", + "10142 * ACHERON, s. m. (Géog. anc. & Myth.) C'étoit ... ACHERON \n", + "1400 * ACHERUSE, s. f. (Géog. Hist. anc. & Myth.) l... ACHERUSE \n", + "14193 * ACHILLEA, s. f. (Géog. anc.) Isle du Pont-Eu... ACHILLEA \n", + "\n", + " normClass author nb Words nb EN \\\n", + "913 Géographie ancienne Diderot 6 2 \n", + "11935 Géographie ancienne Diderot 46 7 \n", + "10142 Géographie ancienne | Mythologie Diderot 49 7 \n", + "1400 Géographie | Histoire ancienne | Mythologie Diderot 112 7 \n", + "14193 Géographie ancienne Diderot 19 4 \n", + "\n", + " nb Name EDDA nb Person nb ENE nb ENE Place nb ENE Person \\\n", + "913 2 0 1 1 0 \n", + "11935 5 0 3 3 0 \n", + "10142 6 0 3 3 0 \n", + "1400 4 1 1 1 0 \n", + "14193 2 1 0 0 0 \n", + "\n", + " nb EN geocoded nb EN EDDA geocoded type latlong latlong value \n", + "913 1 1 ville False NaN \n", + "11935 1 1 NaN False NaN \n", + "10142 3 2 NaN False NaN \n", + "1400 1 1 hydronyme False NaN \n", + "14193 0 0 NaN False NaN " + ] + }, + "execution_count": 11, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "data_diderot.head()" + ] + }, + { + "cell_type": "code", + "execution_count": 32, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "#trier par longueur\n", + "\n", + "sample_long_diderot = data_diderot.sort_values(by=['nb Words'], ascending=False)\n", + "\n", + "sample_long_jaucourt = data_jaucourt.sort_values(by=['nb Words'], ascending=False)\n" + ] + }, + { + "cell_type": "code", + "execution_count": 34, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>9498</th>\n", + " <td>volume02-2049</td>\n", + " <td>2</td>\n", + " <td>2049</td>\n", + " <td>* BILINLOKA, (Géog.) ville de Moldavie.</td>\n", + " <td>BILINLOKA</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>5</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>8363</th>\n", + " <td>volume01-1638</td>\n", + " <td>1</td>\n", + " <td>1638</td>\n", + " <td>* ALEGRE, (Géog.) Voyez Allegre.</td>\n", + " <td>ALEGRE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7947</th>\n", + " <td>volume01-1637</td>\n", + " <td>1</td>\n", + " <td>1637</td>\n", + " <td>* ALEGRANIA, (Géog.) Voyez Allegrania.</td>\n", + " <td>ALEGRANIA</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5877</th>\n", + " <td>volume01-3653</td>\n", + " <td>1</td>\n", + " <td>3653</td>\n", + " <td>* ARCÉE, (Géog.) Voyez Petra.</td>\n", + " <td>ARCÉE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>10080</th>\n", + " <td>volume01-3501</td>\n", + " <td>1</td>\n", + " <td>3501</td>\n", + " <td>* ARACLEA. (Géog.) Voyez Héraclée.</td>\n", + " <td>ARACLEA</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "</div>" + ], + "text/plain": [ + " filename volume number content \\\n", + "9498 volume02-2049 2 2049 * BILINLOKA, (Géog.) ville de Moldavie. \n", + "8363 volume01-1638 1 1638 * ALEGRE, (Géog.) Voyez Allegre. \n", + "7947 volume01-1637 1 1637 * ALEGRANIA, (Géog.) Voyez Allegrania. \n", + "5877 volume01-3653 1 3653 * ARCÉE, (Géog.) Voyez Petra. \n", + "10080 volume01-3501 1 3501 * ARACLEA. (Géog.) Voyez Héraclée. \n", + "\n", + " headword normClass author nb Words nb EN nb Name EDDA \\\n", + "9498 BILINLOKA Géographie Diderot 5 2 2 \n", + "8363 ALEGRE Géographie Diderot 4 2 1 \n", + "7947 ALEGRANIA Géographie Diderot 4 2 2 \n", + "5877 ARCÉE Géographie Diderot 4 2 2 \n", + "10080 ARACLEA Géographie Diderot 4 2 2 \n", + "\n", + " nb Person nb ENE nb ENE Place nb ENE Person nb EN geocoded \\\n", + "9498 0 1 1 0 1 \n", + "8363 0 0 0 0 2 \n", + "7947 0 0 0 0 0 \n", + "5877 0 0 0 0 1 \n", + "10080 0 0 0 0 0 \n", + "\n", + " nb EN EDDA geocoded type latlong latlong value \n", + "9498 1 ville False NaN \n", + "8363 1 NaN False NaN \n", + "7947 0 NaN False NaN \n", + "5877 1 NaN False NaN \n", + "10080 0 NaN False NaN " + ] + }, + "execution_count": 34, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_long_diderot.tail()" + ] + }, + { + "cell_type": "code", + "execution_count": 35, + "metadata": {}, + "outputs": [], + "source": [ + "\n", + "sample_long_diderot = sample_long_diderot.head(100)\n", + "sample_long_jaucourt = sample_long_jaucourt.head(100)" + ] + }, + { + "cell_type": "code", + "execution_count": 36, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>11752</th>\n", + " <td>volume02-1783</td>\n", + " <td>2</td>\n", + " <td>1783</td>\n", + " <td>* BESANÇON, (Géog.) ville de France, capitale ...</td>\n", + " <td>BESANÇON</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>708</td>\n", + " <td>15</td>\n", + " <td>12</td>\n", + " <td>1</td>\n", + " <td>5</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>2377</th>\n", + " <td>volume02-609</td>\n", + " <td>2</td>\n", + " <td>609</td>\n", + " <td>* BARBARIE, s. f. (Géog.) grande contrée d'Afr...</td>\n", + " <td>BARBARIE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>570</td>\n", + " <td>44</td>\n", + " <td>36</td>\n", + " <td>1</td>\n", + " <td>7</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>14</td>\n", + " <td>11</td>\n", + " <td>pays</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13331</th>\n", + " <td>volume01-5149</td>\n", + " <td>1</td>\n", + " <td>5149</td>\n", + " <td>* AUVERGNE (Géographie.) province de France d'...</td>\n", + " <td>AUVERGNE</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>459</td>\n", + " <td>44</td>\n", + " <td>29</td>\n", + " <td>0</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>24</td>\n", + " <td>16</td>\n", + " <td>pays</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6697</th>\n", + " <td>volume01-3489</td>\n", + " <td>1</td>\n", + " <td>3489</td>\n", + " <td>* ARABIE, (Géog. anc. & mod.) pays considérabl...</td>\n", + " <td>ARABIE</td>\n", + " <td>Géographie moderne | Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>459</td>\n", + " <td>54</td>\n", + " <td>35</td>\n", + " <td>2</td>\n", + " <td>10</td>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>18</td>\n", + " <td>16</td>\n", + " <td>pays</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>13426</th>\n", + " <td>volume02-1650</td>\n", + " <td>2</td>\n", + " <td>1650</td>\n", + " <td>* Benin, (Géog.) capitale du royaume de même n...</td>\n", + " <td>Benin</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>458</td>\n", + " <td>13</td>\n", + " <td>10</td>\n", + " <td>1</td>\n", + " <td>5</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>5</td>\n", + " <td>5</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>7824</th>\n", + " <td>volume02-1517</td>\n", + " <td>2</td>\n", + " <td>1517</td>\n", + " <td>* BELCASTRO, (Géog. anc. & mod.) ville d'Itali...</td>\n", + " <td>BELCASTRO</td>\n", + " <td>Géographie moderne | Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>62</td>\n", + " <td>9</td>\n", + " <td>6</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>11257</th>\n", + " <td>volume05-1431</td>\n", + " <td>5</td>\n", + " <td>1431</td>\n", + " <td>* EDESSE, s. f. (Géog. anc. & mod.) ville de l...</td>\n", + " <td>EDESSE</td>\n", + " <td>Géographie moderne | Géographie ancienne</td>\n", + " <td>Diderot</td>\n", + " <td>62</td>\n", + " <td>11</td>\n", + " <td>5</td>\n", + " <td>2</td>\n", + " <td>3</td>\n", + " <td>1</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>6631</th>\n", + " <td>volume01-3754</td>\n", + " <td>1</td>\n", + " <td>3754</td>\n", + " <td>* ARDÉE, (Géog. anc. & Myth.) ville capitale d...</td>\n", + " <td>ARDÉE</td>\n", + " <td>Géographie ancienne | Mythologie</td>\n", + " <td>Diderot</td>\n", + " <td>61</td>\n", + " <td>6</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>2</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5248</th>\n", + " <td>volume02-1769</td>\n", + " <td>2</td>\n", + " <td>1769</td>\n", + " <td>* BERRI, (Géog.) province de France, avec titr...</td>\n", + " <td>BERRI</td>\n", + " <td>Géographie</td>\n", + " <td>Diderot</td>\n", + " <td>60</td>\n", + " <td>13</td>\n", + " <td>11</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>3</td>\n", + " <td>3</td>\n", + " <td>pays</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>8692</th>\n", + " <td>volume01-4577</td>\n", + " <td>1</td>\n", + " <td>4577</td>\n", + " <td>* ATACAMA, (Géog. mod.) port de mer, dans l'Am...</td>\n", + " <td>ATACAMA</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Diderot</td>\n", + " <td>60</td>\n", + " <td>5</td>\n", + " <td>4</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>NaN</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>100 rows × 19 columns</p>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "11752 volume02-1783 2 1783 \n", + "2377 volume02-609 2 609 \n", + "13331 volume01-5149 1 5149 \n", + "6697 volume01-3489 1 3489 \n", + "13426 volume02-1650 2 1650 \n", + "... ... ... ... \n", + "7824 volume02-1517 2 1517 \n", + "11257 volume05-1431 5 1431 \n", + "6631 volume01-3754 1 3754 \n", + "5248 volume02-1769 2 1769 \n", + "8692 volume01-4577 1 4577 \n", + "\n", + " content headword \\\n", + "11752 * BESANÇON, (Géog.) ville de France, capitale ... BESANÇON \n", + "2377 * BARBARIE, s. f. (Géog.) grande contrée d'Afr... BARBARIE \n", + "13331 * AUVERGNE (Géographie.) province de France d'... AUVERGNE \n", + "6697 * ARABIE, (Géog. anc. & mod.) pays considérabl... ARABIE \n", + "13426 * Benin, (Géog.) capitale du royaume de même n... Benin \n", + "... ... ... \n", + "7824 * BELCASTRO, (Géog. anc. & mod.) ville d'Itali... BELCASTRO \n", + "11257 * EDESSE, s. f. (Géog. anc. & mod.) ville de l... EDESSE \n", + "6631 * ARDÉE, (Géog. anc. & Myth.) ville capitale d... ARDÉE \n", + "5248 * BERRI, (Géog.) province de France, avec titr... BERRI \n", + "8692 * ATACAMA, (Géog. mod.) port de mer, dans l'Am... ATACAMA \n", + "\n", + " normClass author nb Words nb EN \\\n", + "11752 Géographie Diderot 708 15 \n", + "2377 Géographie Diderot 570 44 \n", + "13331 Géographie Diderot 459 44 \n", + "6697 Géographie moderne | Géographie ancienne Diderot 459 54 \n", + "13426 Géographie Diderot 458 13 \n", + "... ... ... ... ... \n", + "7824 Géographie moderne | Géographie ancienne Diderot 62 9 \n", + "11257 Géographie moderne | Géographie ancienne Diderot 62 11 \n", + "6631 Géographie ancienne | Mythologie Diderot 61 6 \n", + "5248 Géographie Diderot 60 13 \n", + "8692 Géographie moderne Diderot 60 5 \n", + "\n", + " nb Name EDDA nb Person nb ENE nb ENE Place nb ENE Person \\\n", + "11752 12 1 5 2 0 \n", + "2377 36 1 7 6 0 \n", + "13331 29 0 5 5 0 \n", + "6697 35 2 10 5 0 \n", + "13426 10 1 5 2 2 \n", + "... ... ... ... ... ... \n", + "7824 6 3 3 3 0 \n", + "11257 5 2 3 1 2 \n", + "6631 4 1 2 1 0 \n", + "5248 11 0 1 1 0 \n", + "8692 4 0 0 0 0 \n", + "\n", + " nb EN geocoded nb EN EDDA geocoded type latlong latlong value \n", + "11752 5 4 ville True NaN \n", + "2377 14 11 pays False NaN \n", + "13331 24 16 pays False NaN \n", + "6697 18 16 pays True NaN \n", + "13426 5 5 ville True NaN \n", + "... ... ... ... ... ... \n", + "7824 3 3 ville True NaN \n", + "11257 2 1 ville False NaN \n", + "6631 2 2 ville False NaN \n", + "5248 3 3 pays False NaN \n", + "8692 2 1 NaN True NaN \n", + "\n", + "[100 rows x 19 columns]" + ] + }, + "execution_count": 36, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_long_diderot.head(100)" + ] + }, + { + "cell_type": "code", + "execution_count": 61, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "174.03752865934595\n", + "33.66\n", + "64.0\n", + "21.0\n" + ] + } + ], + "source": [ + "print(data_jaucourt['nb Words'].mean())\n", + "print(data_diderot['nb Words'].mean())\n", + "\n", + "print(data_jaucourt['nb Words'].median())\n", + "print(data_diderot['nb Words'].median())" + ] + }, + { + "cell_type": "code", + "execution_count": 42, + "metadata": {}, + "outputs": [], + "source": [ + "sample_long_diderot.to_csv('../Data/ArticlesLongGeoDiderot-Sample100-21.10.11.tsv', sep='\\t', index=False)\n", + "sample_long_jaucourt.to_csv('../Data/ArticlesLongGeoJaucourt-Sample100-21.10.11.tsv', sep='\\t', index=False)\n", + "#sample10.to_csv('../Data/FranceGEOArticles-Sample10-21.08.17.tsv', sep='\\t', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 48, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "1\n", + "2\n", + "3\n" + ] + } + ], + "source": [] + }, + { + "cell_type": "code", + "execution_count": 62, + "metadata": {}, + "outputs": [], + "source": [ + "data_d = data[(data['author'] == 'Diderot')]\n", + "data_j = data[(data['author'] == 'Jaucourt')]" + ] + }, + { + "cell_type": "code", + "execution_count": 64, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "(5509, 19)\n", + "(17266, 19)\n" + ] + } + ], + "source": [ + "print(data_d.shape)\n", + "print(data_j.shape)" + ] + }, + { + "cell_type": "code", + "execution_count": 65, + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "282.0755820688058\n", + "205.50172445089854\n", + "90.0\n", + "43.0\n" + ] + } + ], + "source": [ + "print(data_j['nb Words'].mean())\n", + "print(data_d['nb Words'].mean())\n", + "\n", + "print(data_j['nb Words'].median())\n", + "print(data_d['nb Words'].median())" + ] + }, + { + "cell_type": "code", + "execution_count": 69, + "metadata": {}, + "outputs": [], + "source": [ + "sample_long_d = data_d.sort_values(by=['nb Words'], ascending=False)\n", + "sample_long_j = data_j.sort_values(by=['nb Words'], ascending=False)" + ] + }, + { + "cell_type": "code", + "execution_count": 70, + "metadata": {}, + "outputs": [], + "source": [ + "sample_long_d = sample_long_d.head(100)\n", + "sample_long_j = sample_long_j.head(100)" + ] + }, + { + "cell_type": "code", + "execution_count": 72, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>14484</th>\n", + " <td>volume05-2355</td>\n", + " <td>5</td>\n", + " <td>2355</td>\n", + " <td>* ENCYCLOPÉDIE, s. f. (Philosoph.) Ce mot sign...</td>\n", + " <td>ENCYCLOPÉDIE</td>\n", + " <td>Philosophie</td>\n", + " <td>Diderot</td>\n", + " <td>36933</td>\n", + " <td>472</td>\n", + " <td>333</td>\n", + " <td>65</td>\n", + " <td>109</td>\n", + " <td>5</td>\n", + " <td>3</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>27187</th>\n", + " <td>volume09-118</td>\n", + " <td>9</td>\n", + " <td>118</td>\n", + " <td>* Juifs, Philosophie des, (Hist. de la Philoso...</td>\n", + " <td>Juifs, Philosophie des</td>\n", + " <td>Histoire de la philosophie</td>\n", + " <td>Diderot</td>\n", + " <td>34746</td>\n", + " <td>1295</td>\n", + " <td>623</td>\n", + " <td>228</td>\n", + " <td>349</td>\n", + " <td>30</td>\n", + " <td>18</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>15784</th>\n", + " <td>volume04-950</td>\n", + " <td>4</td>\n", + " <td>950</td>\n", + " <td>* Corderie, (Ord. encyclop. Entend. Mémoire. H...</td>\n", + " <td>Corderie</td>\n", + " <td>Corderie</td>\n", + " <td>Diderot</td>\n", + " <td>32333</td>\n", + " <td>434</td>\n", + " <td>334</td>\n", + " <td>18</td>\n", + " <td>120</td>\n", + " <td>4</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>14131</th>\n", + " <td>volume05-1220</td>\n", + " <td>5</td>\n", + " <td>1220</td>\n", + " <td>* ECLECTISME, s. m. (Hist. de la Philosophie a...</td>\n", + " <td>ECLECTISME</td>\n", + " <td>Histoire de la philosophie ancienne | Histoire...</td>\n", + " <td>Diderot</td>\n", + " <td>30178</td>\n", + " <td>735</td>\n", + " <td>356</td>\n", + " <td>70</td>\n", + " <td>212</td>\n", + " <td>16</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>130</th>\n", + " <td>volume03-709</td>\n", + " <td>3</td>\n", + " <td>709</td>\n", + " <td>* CHAPEAU, s. m. (Art méchan.) ce terme 2 deux...</td>\n", + " <td>CHAPEAU</td>\n", + " <td>Art méchanique</td>\n", + " <td>Diderot</td>\n", + " <td>19399</td>\n", + " <td>252</td>\n", + " <td>213</td>\n", + " <td>7</td>\n", + " <td>65</td>\n", + " <td>6</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>24730</th>\n", + " <td>volume02-5076</td>\n", + " <td>2</td>\n", + " <td>5076</td>\n", + " <td>* CANAL ARTIFICIEL, (Hist. & Architecture.) li...</td>\n", + " <td>CANAL ARTIFICIEL</td>\n", + " <td>Histoire | Architecture</td>\n", + " <td>Diderot</td>\n", + " <td>1934</td>\n", + " <td>153</td>\n", + " <td>109</td>\n", + " <td>13</td>\n", + " <td>41</td>\n", + " <td>10</td>\n", + " <td>4</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>25356</th>\n", + " <td>volume08-1114</td>\n", + " <td>8</td>\n", + " <td>1114</td>\n", + " <td>* HIÉRARCHIE, s. f. (Hist. ecclésiast.) il se ...</td>\n", + " <td>HIÉRARCHIE</td>\n", + " <td>Histoire ecclésiastique</td>\n", + " <td>Diderot</td>\n", + " <td>1912</td>\n", + " <td>47</td>\n", + " <td>11</td>\n", + " <td>3</td>\n", + " <td>16</td>\n", + " <td>0</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>22363</th>\n", + " <td>volume02-4565</td>\n", + " <td>2</td>\n", + " <td>4565</td>\n", + " <td>* CADENAT, s. m. est une espece de petite serr...</td>\n", + " <td>CADENAT</td>\n", + " <td>unclassified</td>\n", + " <td>Diderot</td>\n", + " <td>1839</td>\n", + " <td>58</td>\n", + " <td>39</td>\n", + " <td>1</td>\n", + " <td>27</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>5752</th>\n", + " <td>volume03-774</td>\n", + " <td>3</td>\n", + " <td>774</td>\n", + " <td>* CHAR, s. m. (Hist. anc. & mod.) On donnoit a...</td>\n", + " <td>CHAR</td>\n", + " <td>Histoire ancienne | Histoire moderne</td>\n", + " <td>Diderot</td>\n", + " <td>1810</td>\n", + " <td>71</td>\n", + " <td>31</td>\n", + " <td>14</td>\n", + " <td>17</td>\n", + " <td>0</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>4041</th>\n", + " <td>volume03-2171</td>\n", + " <td>3</td>\n", + " <td>2171</td>\n", + " <td>* CITOYEN, s. m. (Hist. anc. mod. Droit publ.)...</td>\n", + " <td>CITOYEN</td>\n", + " <td>Droit public | Histoire moderne | Histoire anc...</td>\n", + " <td>Diderot</td>\n", + " <td>1758</td>\n", + " <td>48</td>\n", + " <td>22</td>\n", + " <td>9</td>\n", + " <td>13</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>100 rows × 19 columns</p>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "14484 volume05-2355 5 2355 \n", + "27187 volume09-118 9 118 \n", + "15784 volume04-950 4 950 \n", + "14131 volume05-1220 5 1220 \n", + "130 volume03-709 3 709 \n", + "... ... ... ... \n", + "24730 volume02-5076 2 5076 \n", + "25356 volume08-1114 8 1114 \n", + "22363 volume02-4565 2 4565 \n", + "5752 volume03-774 3 774 \n", + "4041 volume03-2171 3 2171 \n", + "\n", + " content \\\n", + "14484 * ENCYCLOPÉDIE, s. f. (Philosoph.) Ce mot sign... \n", + "27187 * Juifs, Philosophie des, (Hist. de la Philoso... \n", + "15784 * Corderie, (Ord. encyclop. Entend. Mémoire. H... \n", + "14131 * ECLECTISME, s. m. (Hist. de la Philosophie a... \n", + "130 * CHAPEAU, s. m. (Art méchan.) ce terme 2 deux... \n", + "... ... \n", + "24730 * CANAL ARTIFICIEL, (Hist. & Architecture.) li... \n", + "25356 * HIÉRARCHIE, s. f. (Hist. ecclésiast.) il se ... \n", + "22363 * CADENAT, s. m. est une espece de petite serr... \n", + "5752 * CHAR, s. m. (Hist. anc. & mod.) On donnoit a... \n", + "4041 * CITOYEN, s. m. (Hist. anc. mod. Droit publ.)... \n", + "\n", + " headword \\\n", + "14484 ENCYCLOPÉDIE \n", + "27187 Juifs, Philosophie des \n", + "15784 Corderie \n", + "14131 ECLECTISME \n", + "130 CHAPEAU \n", + "... ... \n", + "24730 CANAL ARTIFICIEL \n", + "25356 HIÉRARCHIE \n", + "22363 CADENAT \n", + "5752 CHAR \n", + "4041 CITOYEN \n", + "\n", + " normClass author nb Words \\\n", + "14484 Philosophie Diderot 36933 \n", + "27187 Histoire de la philosophie Diderot 34746 \n", + "15784 Corderie Diderot 32333 \n", + "14131 Histoire de la philosophie ancienne | Histoire... Diderot 30178 \n", + "130 Art méchanique Diderot 19399 \n", + "... ... ... ... \n", + "24730 Histoire | Architecture Diderot 1934 \n", + "25356 Histoire ecclésiastique Diderot 1912 \n", + "22363 unclassified Diderot 1839 \n", + "5752 Histoire ancienne | Histoire moderne Diderot 1810 \n", + "4041 Droit public | Histoire moderne | Histoire anc... Diderot 1758 \n", + "\n", + " nb EN nb Name EDDA nb Person nb ENE nb ENE Place nb ENE Person \\\n", + "14484 472 333 65 109 5 3 \n", + "27187 1295 623 228 349 30 18 \n", + "15784 434 334 18 120 4 1 \n", + "14131 735 356 70 212 16 6 \n", + "130 252 213 7 65 6 2 \n", + "... ... ... ... ... ... ... \n", + "24730 153 109 13 41 10 4 \n", + "25356 47 11 3 16 0 1 \n", + "22363 58 39 1 27 0 0 \n", + "5752 71 31 14 17 0 2 \n", + "4041 48 22 9 13 2 0 \n", + "\n", + " nb EN geocoded nb EN EDDA geocoded type latlong latlong value \n", + "14484 0 0 NaN False NaN \n", + "27187 0 0 NaN False NaN \n", + "15784 0 0 NaN False NaN \n", + "14131 0 0 NaN False NaN \n", + "130 0 0 NaN False NaN \n", + "... ... ... ... ... ... \n", + "24730 0 0 NaN False NaN \n", + "25356 0 0 NaN False NaN \n", + "22363 0 0 NaN False NaN \n", + "5752 0 0 NaN False NaN \n", + "4041 0 0 NaN False NaN \n", + "\n", + "[100 rows x 19 columns]" + ] + }, + "execution_count": 72, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_long_d.head(100)" + ] + }, + { + "cell_type": "code", + "execution_count": 73, + "metadata": {}, + "outputs": [ + { + "data": { + "text/html": [ + "<div>\n", + "<style scoped>\n", + " .dataframe tbody tr th:only-of-type {\n", + " vertical-align: middle;\n", + " }\n", + "\n", + " .dataframe tbody tr th {\n", + " vertical-align: top;\n", + " }\n", + "\n", + " .dataframe thead th {\n", + " text-align: right;\n", + " }\n", + "</style>\n", + "<table border=\"1\" class=\"dataframe\">\n", + " <thead>\n", + " <tr style=\"text-align: right;\">\n", + " <th></th>\n", + " <th>filename</th>\n", + " <th>volume</th>\n", + " <th>number</th>\n", + " <th>content</th>\n", + " <th>headword</th>\n", + " <th>normClass</th>\n", + " <th>author</th>\n", + " <th>nb Words</th>\n", + " <th>nb EN</th>\n", + " <th>nb Name EDDA</th>\n", + " <th>nb Person</th>\n", + " <th>nb ENE</th>\n", + " <th>nb ENE Place</th>\n", + " <th>nb ENE Person</th>\n", + " <th>nb EN geocoded</th>\n", + " <th>nb EN EDDA geocoded</th>\n", + " <th>type</th>\n", + " <th>latlong</th>\n", + " <th>latlong value</th>\n", + " </tr>\n", + " </thead>\n", + " <tbody>\n", + " <tr>\n", + " <th>48722</th>\n", + " <td>volume11-4606</td>\n", + " <td>11</td>\n", + " <td>4606</td>\n", + " <td>PARIS, (Géog. mod.) ville capitale du royaume ...</td>\n", + " <td>PARIS</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>20530</td>\n", + " <td>1195</td>\n", + " <td>435</td>\n", + " <td>284</td>\n", + " <td>547</td>\n", + " <td>151</td>\n", + " <td>47</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>ville</td>\n", + " <td>True</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>45431</th>\n", + " <td>volume10-1265</td>\n", + " <td>10</td>\n", + " <td>1265</td>\n", + " <td>MÉDECINE, s. f. (Art & Science.) La Médecine e...</td>\n", + " <td>MÉDECINE</td>\n", + " <td>Arts | Science</td>\n", + " <td>Jaucourt</td>\n", + " <td>19687</td>\n", + " <td>610</td>\n", + " <td>232</td>\n", + " <td>175</td>\n", + " <td>166</td>\n", + " <td>14</td>\n", + " <td>7</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>53811</th>\n", + " <td>volume16-3477</td>\n", + " <td>16</td>\n", + " <td>3477</td>\n", + " <td>TRIUMVIRAT, s. m. (Hist. rom.) c'est le nom la...</td>\n", + " <td>TRIUMVIRAT</td>\n", + " <td>Histoire romaine</td>\n", + " <td>Jaucourt</td>\n", + " <td>17669</td>\n", + " <td>865</td>\n", + " <td>269</td>\n", + " <td>259</td>\n", + " <td>255</td>\n", + " <td>25</td>\n", + " <td>17</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>52290</th>\n", + " <td>volume14-4631</td>\n", + " <td>14</td>\n", + " <td>4631</td>\n", + " <td>Sculpteurs anciens, (Sculpt. antiq.) comme les...</td>\n", + " <td>Sculpteurs anciens</td>\n", + " <td>Sculpture antique</td>\n", + " <td>Jaucourt</td>\n", + " <td>16692</td>\n", + " <td>1121</td>\n", + " <td>294</td>\n", + " <td>282</td>\n", + " <td>349</td>\n", + " <td>51</td>\n", + " <td>24</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>34775</th>\n", + " <td>volume12-1392</td>\n", + " <td>12</td>\n", + " <td>1392</td>\n", + " <td>Pere de l'Église, (Hist. ecclésiast.) on nomme...</td>\n", + " <td>Pere de l'Église</td>\n", + " <td>Histoire ecclésiastique</td>\n", + " <td>Jaucourt</td>\n", + " <td>13862</td>\n", + " <td>760</td>\n", + " <td>283</td>\n", + " <td>167</td>\n", + " <td>231</td>\n", + " <td>15</td>\n", + " <td>20</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>...</th>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " <td>...</td>\n", + " </tr>\n", + " <tr>\n", + " <th>70369</th>\n", + " <td>volume17-2067</td>\n", + " <td>17</td>\n", + " <td>2067</td>\n", + " <td>WANTAGE, (Géog. mod.) bourg à marché d'Anglete...</td>\n", + " <td>WANTAGE</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>4143</td>\n", + " <td>171</td>\n", + " <td>66</td>\n", + " <td>36</td>\n", + " <td>51</td>\n", + " <td>7</td>\n", + " <td>5</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>64126</th>\n", + " <td>volume17-1690</td>\n", + " <td>17</td>\n", + " <td>1690</td>\n", + " <td>VOORHOUT, (Géog. mod.) village de Hollande, su...</td>\n", + " <td>VOORHOUT</td>\n", + " <td>Géographie moderne</td>\n", + " <td>Jaucourt</td>\n", + " <td>4117</td>\n", + " <td>130</td>\n", + " <td>71</td>\n", + " <td>40</td>\n", + " <td>42</td>\n", + " <td>3</td>\n", + " <td>4</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>ville</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>54728</th>\n", + " <td>volume16-3376</td>\n", + " <td>16</td>\n", + " <td>3376</td>\n", + " <td>TRIOMPHE, (Hist. rom.) cérémonie & honneur ex...</td>\n", + " <td>TRIOMPHE</td>\n", + " <td>Histoire romaine</td>\n", + " <td>Jaucourt</td>\n", + " <td>4083</td>\n", + " <td>209</td>\n", + " <td>80</td>\n", + " <td>45</td>\n", + " <td>68</td>\n", + " <td>5</td>\n", + " <td>6</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>32813</th>\n", + " <td>volume10-880</td>\n", + " <td>10</td>\n", + " <td>880</td>\n", + " <td>MASQUE de théatre, (Hist. du théatre des ancie...</td>\n", + " <td>MASQUE de théatre</td>\n", + " <td>Histoire du théatre des anciens</td>\n", + " <td>Jaucourt</td>\n", + " <td>4079</td>\n", + " <td>107</td>\n", + " <td>24</td>\n", + " <td>17</td>\n", + " <td>25</td>\n", + " <td>4</td>\n", + " <td>2</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " <tr>\n", + " <th>58403</th>\n", + " <td>volume15-75</td>\n", + " <td>15</td>\n", + " <td>75</td>\n", + " <td>SENSITIVE, (Botan.) plante fort connue par la ...</td>\n", + " <td>SENSITIVE</td>\n", + " <td>Botanique</td>\n", + " <td>Jaucourt</td>\n", + " <td>4050</td>\n", + " <td>66</td>\n", + " <td>41</td>\n", + " <td>5</td>\n", + " <td>17</td>\n", + " <td>2</td>\n", + " <td>1</td>\n", + " <td>0</td>\n", + " <td>0</td>\n", + " <td>NaN</td>\n", + " <td>False</td>\n", + " <td>NaN</td>\n", + " </tr>\n", + " </tbody>\n", + "</table>\n", + "<p>100 rows × 19 columns</p>\n", + "</div>" + ], + "text/plain": [ + " filename volume number \\\n", + "48722 volume11-4606 11 4606 \n", + "45431 volume10-1265 10 1265 \n", + "53811 volume16-3477 16 3477 \n", + "52290 volume14-4631 14 4631 \n", + "34775 volume12-1392 12 1392 \n", + "... ... ... ... \n", + "70369 volume17-2067 17 2067 \n", + "64126 volume17-1690 17 1690 \n", + "54728 volume16-3376 16 3376 \n", + "32813 volume10-880 10 880 \n", + "58403 volume15-75 15 75 \n", + "\n", + " content headword \\\n", + "48722 PARIS, (Géog. mod.) ville capitale du royaume ... PARIS \n", + "45431 MÉDECINE, s. f. (Art & Science.) La Médecine e... MÉDECINE \n", + "53811 TRIUMVIRAT, s. m. (Hist. rom.) c'est le nom la... TRIUMVIRAT \n", + "52290 Sculpteurs anciens, (Sculpt. antiq.) comme les... Sculpteurs anciens \n", + "34775 Pere de l'Église, (Hist. ecclésiast.) on nomme... Pere de l'Église \n", + "... ... ... \n", + "70369 WANTAGE, (Géog. mod.) bourg à marché d'Anglete... WANTAGE \n", + "64126 VOORHOUT, (Géog. mod.) village de Hollande, su... VOORHOUT \n", + "54728 TRIOMPHE, (Hist. rom.) cérémonie & honneur ex... TRIOMPHE \n", + "32813 MASQUE de théatre, (Hist. du théatre des ancie... MASQUE de théatre \n", + "58403 SENSITIVE, (Botan.) plante fort connue par la ... SENSITIVE \n", + "\n", + " normClass author nb Words nb EN \\\n", + "48722 Géographie moderne Jaucourt 20530 1195 \n", + "45431 Arts | Science Jaucourt 19687 610 \n", + "53811 Histoire romaine Jaucourt 17669 865 \n", + "52290 Sculpture antique Jaucourt 16692 1121 \n", + "34775 Histoire ecclésiastique Jaucourt 13862 760 \n", + "... ... ... ... ... \n", + "70369 Géographie moderne Jaucourt 4143 171 \n", + "64126 Géographie moderne Jaucourt 4117 130 \n", + "54728 Histoire romaine Jaucourt 4083 209 \n", + "32813 Histoire du théatre des anciens Jaucourt 4079 107 \n", + "58403 Botanique Jaucourt 4050 66 \n", + "\n", + " nb Name EDDA nb Person nb ENE nb ENE Place nb ENE Person \\\n", + "48722 435 284 547 151 47 \n", + "45431 232 175 166 14 7 \n", + "53811 269 259 255 25 17 \n", + "52290 294 282 349 51 24 \n", + "34775 283 167 231 15 20 \n", + "... ... ... ... ... ... \n", + "70369 66 36 51 7 5 \n", + "64126 71 40 42 3 4 \n", + "54728 80 45 68 5 6 \n", + "32813 24 17 25 4 2 \n", + "58403 41 5 17 2 1 \n", + "\n", + " nb EN geocoded nb EN EDDA geocoded type latlong latlong value \n", + "48722 0 0 ville True NaN \n", + "45431 0 0 NaN False NaN \n", + "53811 0 0 NaN False NaN \n", + "52290 0 0 NaN False NaN \n", + "34775 0 0 NaN False NaN \n", + "... ... ... ... ... ... \n", + "70369 0 0 ville False NaN \n", + "64126 0 0 ville False NaN \n", + "54728 0 0 NaN False NaN \n", + "32813 0 0 NaN False NaN \n", + "58403 0 0 NaN False NaN \n", + "\n", + "[100 rows x 19 columns]" + ] + }, + "execution_count": 73, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "sample_long_j.head(100)" + ] + }, + { + "cell_type": "code", + "execution_count": 71, + "metadata": {}, + "outputs": [], + "source": [ + "sample_long_d.to_csv('../Data/ArticlesLongDiderot-Sample100-21.10.11.tsv', sep='\\t', index=False)\n", + "sample_long_j.to_csv('../Data/ArticlesLongJaucourt-Sample100-21.10.11.tsv', sep='\\t', index=False)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": {}, + "outputs": [], + "source": [] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.7.3" + } + }, + "nbformat": 4, + "nbformat_minor": 4 +}