Skip to content
Snippets Groups Projects
ICHLL_Brenon.md 60.52 KiB
title: Encoding the Specificities of Encyclopedias
author: Alice [Brenon]{.smallcaps} ^1,2^
institute:
	- ICAR, CNRS, UMR5191, 69342
	- Univ Lyon, INSA Lyon, CNRS, UCBL, LIRIS, UMR5205, F-69621
numbersections: True
documentclass: article
classoption:
	- english
	- a4paper
	- 12pt
mainfont: "Libertinus Serif"
header-includes:
	- \usepackage{textalpha}
	- \usepackage{hyperref}
	- \usepackage{geometry}
	- \geometry{margin=25.4mm}
	- \hypersetup{
	        colorlinks,
	        linkcolor = blue,
	        urlcolor = blue
	    }

\begin{center} {\small \textsuperscript{1} ICAR, CNRS, UMR5191, 69342}\ {\small \textsuperscript{2} Univ Lyon, INSA Lyon, CNRS, UCBL, LIRIS, UMR5205, F-69621}\ \end{center}

Abstract This chapter illustrates the fundamental differences between dictionaries and encyclopedias by documenting the process of devising an encoding scheme and applying it to a late-19^th^ century encyclopedia, "La Grande Encyclopédie" (hence LGE). The effort, made in the context of project DISCO-LGE, consisted in working from an OCRised version of the pages in XML-ALTO to produce a fully XML-TEI-compliant encoding of the individual articles. Although the TEI guidelines include a specialised module for dictionaries which was identified as a promising tool for the task, systematic traversal of the schema using graph search methods revealed some limitations when used to encode this text. These shortcomings are reviewed and illustrated on a series of examples. An alternative encoding remaining within the core module of TEI is then proposed and demonstrated on articles from LGE containing key features. Finally, different strategies followed by other projects are discussed.

Keywords digital humanities, XML-TEI, dictionaries, encyclopedias

Introduction

Although both terms have been used rather interchangeably over the past few centuries, a dichotomy is now commonly being made between dictionaries and encyclopedias. A simple opposition can easily justify this distinction: dictionaries define words and tell one how to use them while encyclopedia usually go into longer development to give a more comprehensive and scientific understanding of the concept being defined. This common intuition links back to the entry written in the Encyclopédie ou Dictionnaire raisonné des sciences des arts et des métiers (hence EDdA) by @dalembert_dictionnaire_2022 [article DICTIONNAIRE, volume 4] who opposes three kinds of dictionaries: one to define words, the second to define facts and the last one to define things, corresponding respectively to language, history, and science and arts dictionaries. The first type corresponds to modern dictionaries while the two others are similar to what one expects to find in an encyclopedia.

However, d'Alembert himself doesn't think of these boundaries as very strict and he hints at the extreme difficulty in merely defining words without going into semantics and philosophical considerations:

un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit être souvent un dictionnaire de choses quand il est bien fait

("a language dictionary, which appears to be only a word dictionary, must often be a thing dictionary when it is made properly"). A similar criticism is made by @haiman_dictionaries_1980 [p. 331] who attacks no less than six criteria on which dictionaries and encyclopedias are generally opposed to reach the conclusion that there is no distinction between them because "dictionaries are encyclopedias". Regardless of the validity of his reasoning, it only proves one inclusion: that perhaps, dictionaries would be a special case of encyclopedias. This, as will be shown, does by no means imply that conversely encyclopedias are dictionaries.

XML-TEI is a set of guidelines, tools and tranining resources collectively developped by the @tei_consortium_tei_2023 to represent text in a highly structured and machine-readable format. Its toolbox has a modular structure consisting of optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains like poetry, or, in the case at hand, dictionaries. The intrinsic complexity of dictionaries has been well identified since the inception of the project [@tei_vault] and @ide_encoding_1995 underline the amount of work which went into the third version of the guidelines (P3) to provide a toolbox both general and expressive enough to account for the variety of conventions found in dictionaries. This module has been successfully used to encode both historical [@williams2017; @bohbot2018] and digitally native dictionaries [@bowers_bridging_2018]. In addition, a specific guidelines tailored at encoding dictionaries named TEI-Lex0 has also been published [@banski_tei_lex0_2017].

The TEI effort is described by @ide_background_1998 as "first steps" to reach a standard to encode corpora and lay a common basis for corpora comparison and reuse. They point some light inconsistencies in the design, remark that there is generally more than one way to encode a given text in XML-TEI and identify nine criteria to design a sound standard. Their claims are backed by concrete examples of encoding situations but give no idea of the prevalence of the issues reported. In fact, the sheer complexity of the guidelines can make it hard to ascertain whether a particular element structure is impossible to represent (not finding a suitable encoding is not a proof that there is none). This chapter will use results from graph theory to make a systematic study of the possibilities and shortcomings of the TEI dictionaries module, hence providing an additional proof that encyclopedias are not dictionaries and that the inclusion claimed by Haiman is a strict one.

Context of the study

To give a better understanding of this research, this section describes the aims of the project from which it stems before giving a short history of the term encyclopedia and underlining the known differences between dictionaries and encyclopedias which constitute the starting point of this investigation.

CollEx-Persée Project DISCO-LGE

The project (https://www.collexpersee.eu/projet/disco-lge/) set out to study La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des Arts par une Société de savants et de gens de lettres (hence LGE), an encyclopedia published in France between 1885 and 1902 by an organised team of over two hundred specialists divided into eleven sections. This text comprises 31 tomes of about 1200 pages each and according to @jacquet-pfau2015 [pp. 88 et seq.] was the last major french encyclopedic endeavour directly inheriting from the prestigious ancestor that was the EDdA published by Diderot and d'Alembert 130 years earlier, between 1751 and 1772.

The aim of the project was to digitise and make LGE available to the scientific community as well as the general public. A previous version of this encyclopedia was partially available on Gallica (https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&collapsing=disabled&query=dc.relation%20all%20%22cb377013071%22) but lacked in quality and its text had not been fully extracted from the pictures with an Optical Characters Recognition (OCR) system. This prevented an exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As a prelude to project GEODE (https://geode-project.github.io/), the goal of DISCO-LGE was to produce a digital version of LGE with a quality comparable to the one of l'EDdA provided by the ARTFL (http://artfl-project.uchicago.edu/) project in order to conduct a diachronic study of both encyclopedias.

Encyclopedia

If the word "encyclopedia" is now part of everyday vocabulary and has a slightly different meaning from dictionary, it was much more unusual and in fact controversial when Diderot and d'Alembert decided to use it in the title of their book, while having to coordinate them both in the full title of the EDdA which is probably the most famous work of the genre and a symbol of the Age of Enlightenment.

The definition given by Furetière in his Dictionnaire Universel in 1690 is still close to its greek etymology: a "ring of all knowledges", from κύκλος, "circle", and παιδεία, "knowledge". This meaning is the one used for instance by Rabelais in Pantagruel, when he has Thaumaste declare that Panurge opened to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of Encyclopedia"). At the time the word still mostly refers to the abstract concept of mastering all knowledges at once. Furetière adds that it's a quality one is unlikely to possess, and even seems to condemn its pursuit as a form of hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" ("it is a recklessness for a man to want to possess Encyclopedia").

Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated at the end of the 17^th^ century and attacked in the Dictionnaire Universel François et Latin, commonly refered to as the Dictionnaire de Trevoux, as utterly "burlesque" ("parodic"). The entry for "Encyclopédie" remained unchanged in the four editions issued between 1721 and 1752, mocking the use of the word and discouraging his readers to pursue it. In that intent, he quotes a poem from Pibrac encouraging people to specialise in only one discipline lest they should not reach perfection, based on an argumentation that resembles the saying "Jack of all trades, master of none". It is all the more interesting that the definition remains unaltered until 1752, one year after the publication of the first volume of the EDdA. The Jesuites who edited Dictionnaire de Trevoux frowned upon the project of the EDdA which they managed to get banned the same year by the Council of State on the charge of attempting to destroy the royal authority, inspiring rebellion and corrupting morality in general. There is much more at stake than words here, but the attempt to deprecate the word itself is part of their fight against the philosophers of the Enlightenment.

The attacks do not remain ignored by Diderot who starts the very definition of the word "Encyclopédie" in the EDdA itself by a strong rebuttal. He directly dismisses the concerns expressed in the Dictionnaire de Trevoux as mere self-doubt that their authors should not generalise to anyone, then leaves the main point to a latin quote by chancelor Bacon [@lojkine2013, p. 5], who argues that a collaborative work can achieve much more than any talented man could: what could possibly not be within reach of a single man, within a single lifetime may be achieved by a common effort throughout generations.

History hints that Diderot's opponents took his defence of the feasability of the project quite seriously, considering the fact that they got the EDdA's privileges revoked again six years after its publication was resumed [@moureau2001]. As a consequence, the remaining ten volumes containing the text of the articles had to be published illegally until 1765, thanks to the secret protection of Malesherbes who — despite being head of royal censorship — saved the manuscripts from destruction. They were printed secretly outside of Paris and the books were (falsely) labeled as coming from "Neufchâtel" (sic). Following the high demand from the booksellers who feared they would lose the money they had invested in the project, a special privilege was issued for the volumes containing the plates, which were released publicly from 1762 to 1772.

In any case, in their last edition in 1771 the authors of the Dictionnaire de Trevoux had no choice but to acknowledge the success of the encyclopedic projects of the 18^th^ century. In this version, the definition was entirely reworked, mildly stating that good encyclopedias are difficult to make because of the amount of knowledge necessary and work needed to keep up with scientific progress instead of calling the effort a parody. It credits Chambers' Cyclopædia for being a decent attempt before referring anonymously though quite explicitly to Diderot and d'Alembert's project by naming the collective "Une Société de gens de Lettres" and writing that it started in 1751. Even more importantly, two new entries were added after it: one for the adjective "encyclopédique" and another one for the noun "encyclopédiste", silently admitting how the project had changed its time and the relation to knowledge itself.

A different approach

If encyclopedias are thus historically more recent than dictionaries they also depart from the latter on their approach. The purpose of dictionaries from their origin is to collect words, to make an exhaustive inventory of the terms used in a domain or in a language in order to associate a definition to them, be it a phrase explaining it or a translation in another language for a foreign language dictionary. As such, they are collections of signs and are more concerned with the linguistic level of things. Entries in a dictionary often feature information such as the part of speech, the pronunciation or the etymology of the word they define.

In the full title of the EDdA, the concept of encyclopedia is more or less equated by means of the coordinating conjunction "ou" to a Dictionnaire raisonné, "reasoned dictionary", introducing the idea that encyclopedias are dictionaries with some additional structure and a philosophical dimension.

Back to the "Encyclopédie" article one can read that a dictionary remaining strictly at the language level, a vocabulary, can be seen as the empty frame required for an encyclopedic dictionary which will fill it with additional depth. Given how d'Alembert insists on the importance of brevity for a clear definition in the "Dictionnaire de Langues" entry, it is clear that the encyclopédistes did not consider encyclopedias superior to dictionaries but really as a new subgenre departing from them in terms of purpose.

The dictionaries TEI module {#sec:dictionaries-module}

One of the main motivations behind project DISCO-LGE was to produce data useful to future scientific projects, which in particular requires it to be interoperable and reusable. These are the two last key aspects of the FAIR (https://www.go-fair.org/fair-principles/) principles (findability, accessibility, interoperability and reusability) which are important guidelines for efficient, high-quality research. This section starts by describing the existing toolset provided by the XML-TEI guidelines to achieve this goal, before introducing some notations and tools from graph theory which will be used to browse the guidelines in a systematic and thorough way in section @sec:new-standard.

A good starting point {#sec:starting-point}

The dictionaries module has been leveraged to encode dictionaries in projects NENUFAR (https://cahier.hypotheses.org/nenufar) and BASNUM (https://anr.fr/Projet-ANR-18-CE38-0003) to encode respectively the Petit Larousse Illustré published by Pierre Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to LGE, and the Dictionnaire Universel by Furetière, or rather its second version edited by Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^ century [@williams2017, p. 1]. These successes suggested it to be a useful tool to encode encyclopedias but a few differences remained between both projects and DISCO-LGE: the text studied by NENUFAR does not have the encyclopedic dimension LGE has and BASNUM studies a much older text which had a tremendous influence on the european encyclopedic effort of the 18^th^ century but is not as clearly separated from the dictionaric stem as LGE is. For these reasons, the encoding schemes used in these projects could not be reused directly, prompting for a systematic exploration of the XML-TEI schema to devise a new one.

This chapter discusses XML elements and hence needs to name and manipulate them. They will be represented in a monospace font, in the standard XML autoclosing form within angle brackets and with a slash following the element name like <div/> for a div element (https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html). This notation does not mean to imply that they cannot contain raw text or other XML elements, it merely denotes such an element, without any additional assumption. In the context of a concrete document instance this can refer to the markup with all the subtree that possibly spans from it, but the same notation will be used when considering the abstract element and the rules that govern its use in relation to other elements or its attributes.

A graph problem

The XML-TEI specification contains 590 elements, which are each documented on the consortium's website in the online reference pages. With an average of almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step.

The problem can be advantageously transformed to benefit from the results of graph theory by representing the network of the XML elements as a directed graph which nodes are connected or not depending on the inclusion rules of the guidelines. Classical, well-known traversal techniques such as Dijkstra's algorithm [@dijkstra59] which computes the shortest path between two nodes in a graph and reports when they are not connected can then be applied to compute systematically all the possible ways to nest a given element under another without any risk to forget a route because of human error.

Though a particular caution should be applied on the results provided by this algorithm because there is no guarantee that the shortest path is meaningful in general, it at least provides an efficient way to check whether a given element may or not be nested at all under another one and gives a lower bound on the length of a meaningful path if it exists. The accuracy of this heuristic decreases as the length of the path increases in the perfect graph representing the intended, meaningful path between two nodes that a human specialist of the TEI framework could build.

The XML-TEI guidelines graph will hence be defined as follows. One node is created for each one of the 590 elements found in the specification. Then, an edge is placed between source node A and destination B if the schema states that the element represented by B can be contained directly by the element represented by A. That is, the edges in the graph represent the relation "is an admissible direct parent of" (written infix, as in "A is connected to B" if and only if "A is an admissible direct parent of B"). Please note that the word "element" is here used with the same meaning as in the TEI documentation to refer to the conceptual device characterised by a given tag name such as p or div and not to a particular instance of them that may occur in a given document. Figure @fig:dictionaries-subgraph, by using this transformation to display only the dictionaries module, hints at the overall complexity of the whole specification.

The subgraph of the dictionaries module

With this definition, moving from one node to another on the graph has an XML-TEI counterpart. Following an edge from A to B can be understood as preparing an XML structure of an <A/> element containing a <B/> element like this:

<A>
    <B/>
</A>