--- title: The specificities of encoding encyclopedias: towards a new standard ? author: Alice BRENON numbersections: True header-includes: \usepackage{textalpha} \usepackage{hyperref} \hypersetup{ colorlinks, urlcolor = blue } --- # Dictionaries and encyclopedias In common parlance, the terms "dictionaries" and "encyclopedias" are used as near synonyms to refer to books compiling vast amounts of knowledge into lists of definitions ordered alphabetically. Their similarity is even visible in the way they are coordinated in the full title of the *Encyclopédie ou Dictionnaire raisonné des sciences des arts et des métiers* published by Diderot and d'Alembert between 1751 and 1772 and which is probably the most famous work of the genre and a symbol of the Age of Enlightenment. ## "Encyclopedia" If the word "encyclopedia" is nowadays part of our vocabulary, it was much more unusual and in fact controversial when Diderot and d'Alembert decided to use it in the title of their book. The definition given by Furetière in his *Dictionnaire Universel* in 1690 is still close to its greek etymology: a "ring of all knowledges", from *κύκλος*, "circle", and *παιδεία*, "knowledge". This meaning is the one used for instance by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of Encyclopedia"). At the time the word still mostly refers to the abstract concept of mastering all knowledges at once. Furetière adds that it's a quality one is unlikely to possess, and even seems to condemn its search as a form of hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" ("it is a recklessness for a man to want to possess Encyclopedia"). Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated at the end of the 17\textsuperscript{th} century and attacked in the *Dictionnaire Universel François et Latin*, commonly refered to as the *Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for "Encyclopédie" remained unchanged in the four editons issued between 1721 and 1752, mocking the use of the word and discouraging his readers to pursue it. In that intent, he quotes a poem from Pibrac encouraging people to specialise in only one discipline lest they should not reach perfection, based on an argumentation that resembles the saying "Jack of all trades, master of none". It is all the more interesting that the definition remains unaltered until 1752, one year after the publication of the first volume of the *Encyclopédie*. The Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the *Encyclopédie* which they managed to get banned the same year by the Council of State on the charge of attempting to destroy the royal authority, inspiring rebellion and corrupting morality in general. There is much more at stake than words here, but the attempt to deprecate the word itself is part of their fight against the philosophers of the Enlightenment. The attacks do not remain ignored by Diderot who starts the very definition of the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as mere self-doubt that their authors shouldn't generalise to mankind, then leaves the main point to a latin quote by chancelor Bacon, who argues that a collaborative work can achieve much more than any talented man could: what could possibly not be within reach of a single man, within a single lifetime may be achieved by a common effort throughout generations. History hints that Diderot's opponents took his defense of the feasability of the project quite seriously, considering the fact that they got the *Encyclopédie*'s priviledges to be revoked again six years after its publication was resumed and that its remaining volumes had to be published illegally until its end in 1772. However, in their last edition in 1771 the authors of the *Dictionnaire de Trevoux* had no choice but to acknowledge the success of the encyclopedic projects of the 18\textsuperscript{th} century. In this version, the definition was entirely reworked, mildly stating that good encyclopedias are difficult to make because of the amount of knowledge necessary and work needed to keep up with scientific progress instead of calling the effort a parody. It credits Chamber's *Cyclopædia* for being a decent attempt before referring anonymously though quite explicitly to Diderot and d'Alembert's project by naming the collective "Une Société de gens de Lettres" and writing that it started in 1751. Even more importantly, two new entries were added after it: one for the adjective "encyclopédique" and another one for the noun "encyclopédiste", silently admitting how the project had changed its time and the relation to knowledge. ## A different approach If encyclopedia are thus historically more recent than dictionaries they also depart from the latter on their approach. The purpose of dictionaries from their origin is to collect words, to make an exhaustive inventory of the terms used in a domain or in a language in order to associate a *definition* to them, be it a translation in another language for a foreign language dictionary or a phrase explaining it for other dictionaries. As such, they are collections of *signs* and remain within the linguistic level of things. Entries in a dictionary often feature information such as the part of speech, the pronunciation or the etymology of the word they define. The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three types of dictionaries: one to define *words*, the second to define *facts* and the last one to define *things*, corresponding to the distinction between language, history, and science and arts dictionaries although according to its author, d'Alembert, each has to be of more than just one kind to be really good. In the full title of the *Encyclopédie*, the concept is more or less equated by means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*, "reasoned dictionary", introducing the idea of encyclopedias as dictionaries with additional structure and a philosophical dimension. Back to the "Encyclopédie" article we read that a dictionary remaining strictly at the language level, a vocabulary, can be seen as the empty frame required for an encyclopedic dictionary that will fill it with additional depth. Given how d'Alembert insists on the importance of brevity for a clear definition in the "Dictionnaire de Langues" entry, it is clear that for the *encyclopédistes*, encyclopedia aren't superior to dictionaries but really depart from them in terms of purpose. ## La Grande Encyclopédie After emerging from dictionaries during the 18\textsuperscript{th} century, encyclopedias became a fertile subgenre in themselves which kept evolving over the following centuries. One of offsprings of the *Encyclopédie* from the 19\textsuperscript{th} century is entitled *La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des Arts par une Société de savants et de gens de lettres* and was published between 1885 and 1902 by an organised team of over two hundred specialists divided into eleven sections. The aim of [CollEx-Persée project DISCO-LGE](https://www.collexpersee.eu/projet/disco-lge/) was to digitise and make *La Grande Encyclopédie* available to the scientific community as well as the general public. A previous version was partially available on [Gallica](https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&version=1.2&collapsing=disabled&query=%28dc.title%20all%20%22La%20Grande%20encyclop%C3%A9die%22%29%20and%20dc.relation%20all%20%22cb377013071%22&rk=42918;4#) but lacked in quality and its text had not been fully extracted from the pictures with an Optical Characters Recognition (OCR) system. # The *dictionaries* TEI module Producing data useful to future other scientific projects cannot be achieved unless it is *interoperable* and *reusable*. These are the two last key aspects of the [FAIR](https://www.go-fair.org/fair-principles/) principles (*findability*, *accessibility*, *interoperability* and *reusability*) which we strive to follow as a guideline for efficient and quality research. It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the *Text Encoding Initiative* consortium. It consists in a set of technical specifications under the form of XML schemas, along with a range of tools to handle them and training resources. The XML-TEI standard has a modular structure consisting in optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains like poetry, or, in our case, dictionaries. In what follows, we need to name and manipulate XML elements. We choose to represent them in a monospace font, in the standard XML autoclosing form within angle brackets and with a slash following the element name like `<div/>` for a [`div` element](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html). We do not mean by this notation that they cannot contain raw text or other XML elements, merely that we are referring to such an element, with all the subtree that spans from it in the context of a concrete document instance or as an empty structure when we are considering the abstract element and the rules that govern its use in relation to other elements or its attributes. ## Content ## A graph problem The XML-TEI specification contains 590 elements, which are each documented on the consortium's website in the online reference pages. With an average of almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step. We transform the problem by representing this network as a directed graph, using elements of XML-TEI as nodes and placing edges if the destination node may be contained within the source node according to the schema.  By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by an edge" we define *inclusion paths* which allow us to explore which elements may be nested under one another. The nodes visited along the way represent the intermediate XML elements to construct a valid XML tree according to the TEI schema. Given the top-down semantics of those trees, we call the length of an inclusion path its *depth*. Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959) allows us to explore the shortest inclusion paths that exist between elements. Though a particular caution should be applied because there is no guarantee that the shortest path is meaningful in general, it at least provides us with an efficient way to check whether a given element may or not be nested at all under another one and gives an order of magnitude on the length of the path to expect. Of course the accuracy of this heuristic decreases as the length of the elements increases in the perfect graph representing the intended, meaningful path between two nodes that a human specialist of the TEI framework could build. This is still very useful when taking into account the fact that TEI modules are merely "bags" to group the elements and provide hints to human encoders about the tools they might need but have no implication on the inclusion paths between element which cross module boundaries freely. The general graph formalism enables us to describe complex filtering patterns and to implement queries to look for them among the elements exhaustively by algorithmic means even when the shortest-path approach is not enough. For instance, it lets one find that although `<pos/>` may not be directly included within `<entry/>` elements to include information about the part-of-speech of the word that an article defines, the correct way to do so is through a `<form/>` or a `<gramGrp/>`. On the other hand, trying to discover the shortest inclusion path to `<pos/>` from the `<TEI/>` root of the document yields a `<standOff/>`, an element dedicated to store contextual data that accompanies but is not part of the text, not unlike an annex, and widely unrelated to the context of encoding an encyclopedia. A last relevant example on the use of these methods can be given by querying the shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it yields an inclusion directly through `<entryFree/>` (with an inclusion path of length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly not what we want depending on the regularity of the articles we are encoding and the occurrence of other grammatical information such as `<case/>` or `<gen/>` to justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to length 3 returns as expected the path through `<entry/>`, among others. Overall, we get a good general idea: `<pos/>` does not need to be nested very deep, it can appear quite near the "surface" of article entries. ### The `<entry/>` element The central element of the *dictionaries* module is the `<entry/>` element meant to encode one single entry in a dictionary, that is to say a head word associated to its definition. It is the natural way in from the `<body/>` element to the dictionary module: indeed, although `<body/>` may also contain `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of `<entry/>` while the latter is a device to group several related entries together. Both can contain an `<entry/` directly while no obvious inclusion exists the other way around: most (> 96.2%) of the inclusion paths of "reasonable" depth (which we define as strictly inferior to 5, that is twice the average shortest depth between any two nodes) either include `<figure/>` or `<castList/>`, two very specific elements which should not need to appear in an article in general, showing that the purpose of `<entry/>` is not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the documentation but also the structure of the elements graph evidence `<entry/>` as the natural top-most element for an article. This somewhat contrived example hopes to further demonstrate the application of a graph-centred approach to understand the inner workings of the XML-TEI schema. ### Information about the headword itself Once a block for an article is created, it may contain elements useful to represent features such as - its written and spoken forms: `<form/>` - a group of grammatical information: `<gramGrp/>`, that may itself contain as we've seen above `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to describe the form itself for instance, but also information about the categories it belongs to like `<iType/>` for its inflection class in languages with a declension system or `<pos/>` for its part-of-speech - its etymology: `<etym/> - its variants if there is a different spelling in a variety of the language or if it has changed through time: `<usg/>` (though it is not its only purpose) All these are examples and by no means an exhaustive list; the complete set provides the encoder with a toolbox to describe all the information related to the form the entry is found at and seems general enough to accomodate the structure of any book indexing entries by words. ### Cross-references A common feature shared by dictionaries and encyclopedias is the ability to connect entries together by using a word or short phrase as the link, referring the reader to the related concept. This is known as cross-references and can appear either when the definition of a term is adjacent to another one or to catch alternative spellings where some readers might expect the word to appear and redirect them to the form chosen as the reference. In XML-TEI, this is done with the `<xr/>` element. It usually contains the whole phrase performing the redirection, with an imperative locution like "please see […]". The "active" part of the cross-reference, that is the very word within the `<xr/>` that is considered to be the link or, to make a modern-day HTML metaphor, the region that would be clickable, is represented by a `<ref/>` element. Though it is not specific to the *dictionaries* module, we include it in this description of the toolbox because it is particularly useful in the context of dictionaries. This element may have a target attribute which points to the other resource to be accessed by the interested reader. ### Content The remaining part of entries is also usually the largest and represents the content associated to the headword by the entry. In a dictionary, that is its meaning. The `<sense/>` element is a valid child for `<entry/>` and groups together a definition of the term with `<def/>`, usage examples with `<usg/>` (another use of this versatile element) and other high-level information such as translations in other languages. Both `<def/>` and `<usg/>` elements may appear directly under the `<entry/>`. ### Structural remarks Before concluding this description of the *dictionaries* module from the perspective of someone trying to concretely encode a particular dictionary or encyclopedia, we make use of the graph approach again to evidence some its aspects in terms of inclusion structure. First, it is remarkable that all elements in the *dictionaries* module have a cyclic inclusion path, that is to say, there is an inclusion path from each element of this module to itself. Although having such a cycle is a widespread property in the remainder of XML-TEI elements shared by 73.8% of them (411 out of the 557 elements in the other modules), all 33 elements of the *dictionaries* module having one is far above this average. In addition, the cycles appear to be rather short, with an average length of 2.00 versus 2.50 in the rest of the population. This observation is all the more surprising considering the fact that the *dictionaries* module contains short "leaf" elements like `<pos/>` which should not obviously need to admit cycles since one rather expects them to contain only one word, like `<pos>adj</pos>` in the example given in the official documentation. Among those (shortest) cycles, 20 include the `<cit/>` element made to group quotations with a bibliographic reference to their source which should clearly be unnecessary to encode an article in the general case. Secondly, although we have seen examples of connections from this module to the rest of the XML-TEI, especially to the *core* module (see the case of the `<ref/>` element above), the *dictionaries* module appears somewhat isolated from important structural elements like `<head/>` or `<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>` elements to the latter of length shorter or equal to 5 by a systematic traversal of the graph yields exclusively paths (respectively 9042 and 39093 of them) containing either a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly suggests, is used to encode text that doesn't quite fit the regular flow of the document, as for example in the context of an embedded narrative. Both examples displayed in the online documentation feature a `<body/>` as direct child of `<floatingText/>`, neatly separating its content as independent. The purpose of the second one, although its name — short for apparatus — is less clear, is to wrap together several versions of the same excerpts, for instance when there are several possible readings of an unclear group of words in a manuscript, or when the encoder is trying to compile a single version of a piece of work from several sources which disagree over some passage. In both case, it appears obvious that it is not something that is expected to occur naturally in the course of an article in the general case. Thus, despite a rather dense internal connectivity, the *dictionaries* module fails to provide encoders with a device to represent recursively nesting structures like `<div/>`. # A new standard ? Studying the content of *La Grande Encyclopédie* and considering several articles in particular, we identify structures which are specific to encyclopedias and not compatible with the *dictionaries* module presented above. We hence conclude that this module is not able to encode arbitrary encyclopedic content and propose a new fully TEI-compliant encoding scheme remaining outside of it. ## Idiosynchrasies of encyclopedias Browsing through the pages of an encyclopedia reveals a certain number of noticeable differences. It is difficult to make a precise list because the editorial choices may vary greatly between encyclopedias but we discuss some of the most obvious. ### Organised knowledge The first immediately visible feature that sets encyclopedias apart from dictionaris can be found in the *Encyclopédie* as well in *La Grande Encyclopédie* is the presence of subject indicators at the begining of articles right after the headword which organise them into a domain classification system. Those generally cover a broad range of subjects from scientific disciplines to litterature, and extending to political subjects and law. No element in the *dictionaries* module is explicitely designed for the purpose of encoding these indicators. As we have seen above, the elements set is geared towards the words themselves instead of the concept they represent. The closest tool for what we need is found in the `<usg/>` element used with a specific `type` attribute set to `dom` for "domain". Indeed several examples from the documentation encode subject indicators very similar to the ones found in encyclopedias within this element, but the match is not perfect either: all appear within one of multiple senses, as if to clarify each context in which the word can be used, as expected from the element's name, "usage". In encyclopedia, if the domain indicator does in certain cases help to distinguish between several entries sharing the same headword, the concept itself has evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the adjective *raisonné* in the rest of the title directly introduces a notion of structure that links back to the "Systême figuré des connoissances humaines". The authors have devised a branching system to classify all knowledge, and the occurrence at the begining of articles, more than a tool to clear up possible ambiguities also points the reader to the correct place in this mind map. {width=200px} The situation regarding subject indicators is hardly better outside of the module. The `<domain/>` element despite its name belongs exclusively in the header of a document and focuses on the social context of the text, not on the knowledge area it covers. The `<interp/>` despite its name is not so much about labeling something as an interpretation to give to a context (which subject indicators could be if you consider that, placed at the begining, they are used to orient the mind frame of the readers towards a particular subject). However, the documentation clearly demonstrates it as a tool for annotators of a document, which text content is not part of the original document but some additional result of an analysis performed in the context of the encoding, used only throughout references in XML attributes. This point, although not the most concerning, still remains the hardest to address but all things considered the `<usg/>` element stands out as the most relevant. ### The notion of meaning ### Nested structures ### Candidates in the *dictionaries* module - `<sense/>` - `<entryFree/>` - `<note/>` - `<dictScrap/>` / `<floatingText/>` ## Encoding within the *core* module The above remarks explain why the *dictionary* module by itself is unable to represent encyclopedias, where discourse with nested structures of arbitrary depth can occur. Since the *core* module of course accomodates these structures by means of the `<div/>`, `<head/>` and `<p/>` elements, we devise an encoding scheme using them which we recommend using for other projects aiming at representing encyclopedias. To remain consistent with the above remarks we will only concern ourselves with what happens at the level of each article, right under the `<body/>` element. Everything related to metadata happens as expected in the file's `<teiHeader/>` which is well-enough equiped to handle them. In order to present our scheme throughout the following section we will be progressively encoding a reference article, "Cathète" from tome 9.  ### The scheme Each article is represented by a `<div/>`. We suggest setting an `xml:id` attribute on it with as value the — unique, or made so by suffixing a number representing its rank among the various occurrences, even when there's only one for the sake of regularity — head word of the entry, normalised to lowercase, stripping spaces and replacing all non-alphanumerical characters by a dash `'-'` to avoid issues with the XML encoding.  Inside this element should be a `<head/>` enclosing the headword of the article. The usual sub-`<hi/>` elements are available within `<head/>` if the headword is highlighted by any special typographic means such as bold, small capitals, etc. This element should also contain the optional subject indicator within parenthesis that sometimes accompany the headword, with the appropriate standard elements like `<persName/>` occurring in biographical articles or `<interp/>` with a `theme` attribute if the article is given a specific domain in a taxonomy.  We propose to then wrap each different meaning in a separate `<div/>` with the `type` attribute set to `sense` to refer to the `<sense/>` element that would've been used within the *core* module. Each sense should be numbered with the `n` attribute.  In addition, each line within the article must start with a `<lb/>` to mark its begining including before the `<head/>` element, which, although a surprising setup, underlines the fact that in the dense layout of encyclopedias, the carriage return separating two articles is meaningful. Stating each new line explicitly keeps enough information to reconstruct a faithful facsimile but it also has the advantage of highlighting the fact than even though the definition is cut from the headword by being in a separate XML element, they still occur on the same line, which is a typographic choice usually made both in encyclopedias and dictionaries where space is at a premium. Finally, the various sections and sub-sections occurring within the article body may be nested as usual with `<div/>` and sub-`<div/>`s, filled with `<p/>` for paragraphs which can each be titled with `<head/>` elements local to each `<div/>`.  But a typical page of an encyclopedia also features peritext elements, giving information to the reader about the current page number along with the headwords of the first and last articles appearing on the page. Depending Moreover, the layout is often ### Currently implemented The reference implementation for this encoding scheme is the program soprano[^soprano] developed within the scope of project DISCO-LGE to automatically identify individual articles in the flow of raw text from the column and to encode them into XML-TEI files. Though this software has already been used to produce the first TEI version of *La Grande Encyclopédie*, it doesn't yet follow the above specification perfectly. Here is for instance the encoded version of article "Cathète" currently it produces: [^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)  The headword detection system is not able to capture the subject indicators yet so it appears outside of the `<head/>` element. Likewise, since the detection of titles at the begining of each section isn't complete, no structure analysis is performed on the content of the article ## The constraints of automated processing ## Comparison to other approaches ### Bend the semantics ### Custom schema # Conclusion Despite long discussions and interesting proposals each with strong arguments both in favour of and against them, no consensus could be reached. For one part, each projects have specific constraints depending on the type of study it intends to carry, the volume of text, or the condition of the physical source documents. Beyond the technical need for encodings generic enough to share the corpora within the community and compare the results accross various projects, the above results highlights one aspect of a well-known fact within the community of lexicography: encyclopedias and dictionaries differ on several key aspects