Skip to content
Snippets Groups Projects
ICHLL_Brenon.md 28.83 KiB
title: The specificities of encoding encyclopedias: towards a new standard ?
author: Alice BRENON
numbersections: True
header-includes:
	\usepackage{textalpha}
	\usepackage{hyperref}
	\hypersetup{
		colorlinks,
		urlcolor = blue
	}

Dictionaries and encyclopedias

In common parlance, the terms "dictionaries" and "encyclopedias" are used as near synonyms to refer to books compiling vast amounts of knowledge into lists of definitions ordered alphabetically. Their similarity is even visible in the way they are coordinated in the full title of the Encyclopédie ou Dictionnaire raisonné des sciences des arts et des métiers published by Diderot and d'Alembert between 1751 and 1772 and which is probably the most famous work of the genre and a symbol of the Age of Enlightenment.

"Encyclopedia"

If the word "encyclopedia" is nowadays part of our vocabulary, it was much more unusual and in fact controversial when Diderot and d'Alembert decided to use it in the title of their book.

The definition given by Furetière in his Dictionnaire Universel in 1690 is still close to its greek etymology: a "ring of all knowledges", from κύκλος, "circle", and παιδεία, "knowledge". This meaning is the one used for instance by Rabelais in Pantagruel, when he has Thaumaste declare that Panurge opened to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of Encyclopedia"). At the time the word still mostly refers to the abstract concept of mastering all knowledges at once. Furetière adds that it's a quality one is unlikely to possess, and even seems to condemn its search as a form of hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" ("it is a recklessness for a man to want to possess Encyclopedia").

Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated at the end of the 17\textsuperscript{th} century and attacked in the Dictionnaire Universel François et Latin, commonly refered to as the Dictionnaire de Trevoux, as utterly "burlesque" ("parodic"). The entry for "Encyclopédie" remained unchanged in the four editons issued between 1721 and 1752, mocking the use of the word and discouraging his readers to pursue it. In that intent, he quotes a poem from Pibrac encouraging people to specialise in only one discipline lest they should not reach perfection, based on an argumentation that resembles the saying "Jack of all trades, master of none". It is all the more interesting that the definition remains unaltered until 1752, one year after the publication of the first volume of the Encyclopédie. The Jesuites who edited Dictionnaire de Trevoux frowned upon the project of the Encyclopédie which they managed to get banned the same year by the Council of State on the charge of attempting to destroy the royal authority, inspiring rebellion and corrupting morality in general. There is much more at stake than words here, but the attempt to deprecate the word itself is part of their fight against the philosophers of the Enlightenment.

The attacks do not remain ignored by Diderot who starts the very definition of the word "Encyclopédie" in the Encyclopédie itself by a strong rebuttal. He directly dismisses the concerns expressed in the Dictionnaire de Trevoux as mere self-doubt that their authors shouldn't generalise to mankind, then leaves the main point to a latin quote by chancelor Bacon, who argues that a collaborative work can achieve much more than any talented man could: what could possibly not be within reach of a single man, within a single lifetime may be achieved by a common effort throughout generations.

History hints that Diderot's opponents took his defense of the feasability of the project quite seriously, considering the fact that they got the Encyclopédie's priviledges to be revoked again six years after its publication was resumed and that its remaining volumes had to be published illegally until its end in 1772.

However, in their last edition in 1771 the authors of the Dictionnaire de Trevoux had no choice but to acknowledge the success of the encyclopedic projects of the 18\textsuperscript{th} century. In this version, the definition was entirely reworked, mildly stating that good encyclopedias are difficult to make because of the amount of knowledge necessary and work needed to keep up with scientific progress instead of calling the effort a parody. It credits Chamber's Cyclopædia for being a decent attempt before referring anonymously though quite explicitly to Diderot and d'Alembert's project by naming the collective "Une Société de gens de Lettres" and writing that it started in 1751. Even more importantly, two new entries were added after it: one for the adjective "encyclopédique" and another one for the noun "encyclopédiste", silently admitting how the project had changed its time and the relation to knowledge.

A different approach

If encyclopedia are thus historically more recent than dictionaries they also depart from the latter on their approach. The purpose of dictionaries from their origin is to collect words, to make an exhaustive inventory of the terms used in a domain or in a language in order to associate a definition to them, be it a translation in another language for a foreign language dictionary or a phrase explaining it for other dictionaries. As such, they are collections of signs and remain within the linguistic level of things. Entries in a dictionary often feature information such as the part of speech, the pronunciation or the etymology of the word they define.

The entry for "Dictionnaire" in the Encyclopédie distinguishes between three types of dictionaries: one to define words, the second to define facts and the last one to define things, corresponding to the distinction between language, history, and science and arts dictionaries although according to its author, d'Alembert, each has to be of more than just one kind to be really good. In the full title of the Encyclopédie, the concept is more or less equated by means of the coordinating conjunction "ou" to a Dictionnaire raisonné, "reasoned dictionary", introducing the idea of encyclopedias as dictionaries with additional structure and a philosophical dimension.

Back to the "Encyclopédie" article we read that a dictionary remaining strictly at the language level, a vocabulary, can be seen as the empty frame required for an encyclopedic dictionary that will fill it with additional depth. Given how d'Alembert insists on the importance of brevity for a clear definition in the "Dictionnaire de Langues" entry, it is clear that for the encyclopédistes, encyclopedia aren't superior to dictionaries but really depart from them in terms of purpose.

La Grande Encyclopédie

After emerging from dictionaries during the 18\textsuperscript{th} century, encyclopedias became a fertile subgenre in themselves which kept evolving over the following centuries. One of offsprings of the Encyclopédie from the 19\textsuperscript{th} century is entitled La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des Arts par une Société de savants et de gens de lettres and was published between 1885 and 1902 by an organised team of over two hundred specialists divided into eleven sections. The aim of CollEx-Persée project DISCO-LGE was to digitise and make La Grande Encyclopédie available to the scientific community as well as the general public. A previous version was partially available on Gallica but lacked in quality and its text had not been fully extracted from the pictures with an Optical Characters Recognition (OCR) system.

The dictionaries TEI module

Producing data useful to future other scientific projects cannot be achieved unless it is interoperable and reusable. These are the two last key aspects of the FAIR principles (findability, accessibility, interoperability and reusability) which we strive to follow as a guideline for efficient and quality research. It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the Text Encoding Initiative consortium. It consists in a set of technical specifications under the form of XML schemas, along with a range of tools to handle them and training resources.

The XML-TEI standard has a modular structure consisting in optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains like poetry, or, in our case, dictionaries.

In what follows, we need to name and manipulate XML elements. We choose to represent them in a monospace font, in the standard XML autoclosing form within angle brackets and with a slash following the element name like <div/> for a div element. We do not mean by this notation that they cannot contain raw text or other XML elements, merely that we are referring to such an element, with all the subtree that spans from it in the context of a concrete document instance or as an empty structure when we are considering the abstract element and the rules that govern its use in relation to other elements or its attributes.

Content

A graph problem

The XML-TEI specification contains 590 elements, which are each documented on the consortium's website in the online reference pages. With an average of almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step. We transform the problem by representing this network as a directed graph, using elements of XML-TEI as nodes and placing edges if the destination node may be contained within the source node according to the schema.

The subgraph of the dictionaries module

By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by an edge" we define inclusion paths which allow us to explore which elements may be nested under one another. The nodes visited along the way represent the intermediate XML elements to construct a valid XML tree according to the TEI schema. Given the top-down semantics of those trees, we call the length of an inclusion path its depth.

Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959) allows us to explore the shortest inclusion paths that exist between elements. Though a particular caution should be applied because there is no guarantee that the shortest path is meaningful in general, it at least provides us with an efficient way to check whether a given element may or not be nested at all under another one and gives an order of magnitude on the length of the path to expect. Of course the accuracy of this heuristic decreases as the length of the elements increases in the perfect graph representing the intended, meaningful path between two nodes that a human specialist of the TEI framework could build. This is still very useful when taking into account the fact that TEI modules are merely "bags" to group the elements and provide hints to human encoders about the tools they might need but have no implication on the inclusion paths between element which cross module boundaries freely. The general graph formalism enables us to describe complex filtering patterns and to implement queries to look for them among the elements exhaustively by algorithmic means even when the shortest-path approach is not enough.

For instance, it lets one find that although <pos/> may not be directly included within <entry/> elements to include information about the part-of-speech of the word that an article defines, the correct way to do so is through a <form/> or a <gramGrp/>. On the other hand, trying to discover the shortest inclusion path to <pos/> from the <TEI/> root of the document yields a <standOff/>, an element dedicated to store contextual data that accompanies but is not part of the text, not unlike an annex, and widely unrelated to the context of encoding an encyclopedia. A last relevant example on the use of these methods can be given by querying the shortest inclusion path of a <pos/> under the <body/> of the document: it yields an inclusion directly through <entryFree/> (with an inclusion path of length 2), which unlike <entry/> accepts it as a direct child node. Possibly not what we want depending on the regularity of the articles we are encoding and the occurrence of other grammatical information such as <case/> or <gen/> to justify the use of the <gramGrp/>, but searching exhaustively for paths up to length 3 returns as expected the path through <entry/>, among others. Overall, we get a good general idea: <pos/> does not need to be nested very deep, it can appear quite near the "surface" of article entries.

The <entry/> element

The central element of the dictionaries module is the <entry/> element meant to encode one single entry in a dictionary, that is to say a head word associated to its definition. It is the natural way in from the <body/> element to the dictionary module: indeed, although <body/> may also contain <entryFree/> or <superEntry/> elements, the former is a relaxed version of <entry/> while the latter is a device to group several related entries together. Both can contain an <entry/ directly while no obvious inclusion exists the other way around: most (> 96.2%) of the inclusion paths of "reasonable" depth (which we define as strictly inferior to 5, that is twice the average shortest depth between any two nodes) either include <figure/> or <castList/>, two very specific elements which should not need to appear in an article in general, showing that the purpose of <entry/> is not to contain an <entryFree/> or <superEntry/>. Hence, not only the semantics conveyed by the documentation but also the structure of the elements graph evidence <entry/> as the natural top-most element for an article. This somewhat contrived example hopes to further demonstrate the application of a graph-centred approach to understand the inner workings of the XML-TEI schema.

Information about the headword itself