---
title: The specificities of encoding encyclopedias: towards a new standard ?
author: Alice BRENON
numbersections: True
header-includes:
	\usepackage{textalpha}
	\usepackage{hyperref}
	\hypersetup{
		colorlinks,
		urlcolor = blue
	}
---

# Dictionaries and encyclopedias

In common parlance, the terms "dictionaries" and "encyclopedias" are used as
near synonyms to refer to books compiling vast amounts of knowledge into lists
of definitions ordered alphabetically. Their similarity is even visible in the
way they are coordinated in the full title of the *Encyclopédie ou Dictionnaire
raisonné des sciences des arts et des métiers* published by Diderot and
d'Alembert between 1751 and 1772 and which is probably the most famous work of
the genre and a symbol of the Age of Enlightenment.

## "Encyclopedia"

If the word "encyclopedia" is nowadays part of our vocabulary, it was much more
unusual and in fact controversial when Diderot and d'Alembert decided to use it
in the title of their book.

The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
Encyclopedia"). At the time the word still mostly refers to the abstract concept
of mastering all knowledges at once. Furetière adds that it's a quality one
is unlikely to possess, and even seems to condemn its search as a form of
hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
("it is a recklessness for a man to want to possess Encyclopedia").

Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
at the end of the 17\textsuperscript{th} century and attacked in the
*Dictionnaire Universel François et Latin*, commonly refered to as the
*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
"Encyclopédie" remained unchanged in the four editons issued between 1721 and
1752, mocking the use of the word and discouraging his readers to pursue it. In
that intent, he quotes a poem from Pibrac encouraging people to specialise in
only one discipline lest they should not reach perfection, based on an
argumentation that resembles the saying "Jack of all trades, master of none". It
is all the more interesting that the definition remains unaltered until 1752,
one year after the publication of the first volume of the *Encyclopédie*. The
Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
*Encyclopédie* which they managed to get banned the same year by the Council of
State on the charge of attempting to destroy the royal authority, inspiring
rebellion and corrupting morality in general. There is much more at stake than
words here, but the attempt to deprecate the word itself is part of their fight
against the philosophers of the Enlightenment.

The attacks do not remain ignored by Diderot who starts the very definition of
the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
mere self-doubt that their authors shouldn't generalise to mankind, then leaves
the main point to a latin quote by chancelor Bacon, who argues that a
collaborative work can achieve much more than any talented man could: what could
possibly not be within reach of a single man, within a single lifetime may be
achieved by a common effort throughout generations.

History hints that Diderot's opponents took his defense of the feasability of
the project quite seriously, considering the fact that they got the
*Encyclopédie*'s priviledges to be revoked again six years after its publication
was resumed and that its remaining volumes had to be published illegally until
its end in 1772.

However, in their last edition in 1771 the authors of the *Dictionnaire de
Trevoux* had no choice but to acknowledge the success of the encyclopedic
projects of the 18\textsuperscript{th} century. In this version, the definition
was entirely reworked, mildly stating that good encyclopedias are difficult to
make because of the amount of knowledge necessary and work needed to keep up
with scientific progress instead of calling the effort a parody. It credits
Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
though quite explicitly to Diderot and d'Alembert's project by naming the
collective "Une Société de gens de Lettres" and writing that it started in 1751.
Even more importantly, two new entries were added after it: one for the adjective
"encyclopédique" and another one for the noun "encyclopédiste", silently admitting
how the project had changed its time and the relation to knowledge.

## A different approach

If encyclopedia are thus historically more recent than dictionaries they also
depart from the latter on their approach. The purpose of dictionaries from their
origin is to collect words, to make an exhaustive inventory of the terms
used in a domain or in a language in order to associate a *definition* to them,
be it a translation in another language for a foreign language dictionary or a
phrase explaining it for other dictionaries. As such, they are collections of
*signs* and remain within the linguistic level of things. Entries in a dictionary
often feature information such as the part of speech, the pronunciation or the
etymology of the word they define.

The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
types of dictionaries: one to define *words*, the second to define *facts* and
the last one to define *things*, corresponding to the distinction between
language, history, and science and arts dictionaries although according to its
author, d'Alembert, each has to be of more than just one kind to be really good.
In the full title of the *Encyclopédie*, the concept is more or less equated by
means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*,
"reasoned dictionary", introducing the idea of encyclopedias as dictionaries
with additional structure and a philosophical dimension.

Back to the "Encyclopédie" article we read that a dictionary remaining strictly
at the language level, a vocabulary, can be seen as the empty frame required for
an encyclopedic dictionary that will fill it with additional depth. Given how
d'Alembert insists on the importance of brevity for a clear definition in the
"Dictionnaire de Langues" entry, it is clear that for the *encyclopédistes*,
encyclopedia aren't superior to dictionaries but really depart from them in
terms of purpose.

## La Grande Encyclopédie

After emerging from dictionaries during the 18\textsuperscript{th} century,
encyclopedias became a fertile subgenre in themselves which kept evolving over
the following centuries. One of offsprings of the *Encyclopédie* from the
19\textsuperscript{th} century is entitled *La Grande Encyclopédie, Inventaire
raisonné des Sciences, des Lettres et des Arts par une Société de savants et de
gens de lettres* and was published between 1885 and 1902 by an organised team of
over two hundred specialists divided into eleven sections. The aim of
[CollEx-Persée project DISCO-LGE](https://www.collexpersee.eu/projet/disco-lge/)
was to digitise and make *La Grande Encyclopédie* available to the scientific
community as well as the general public. A previous version was partially
available on
[Gallica](https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&version=1.2&collapsing=disabled&query=%28dc.title%20all%20%22La%20Grande%20encyclop%C3%A9die%22%29%20and%20dc.relation%20all%20%22cb377013071%22&rk=42918;4#)
but lacked in quality and its text had not been fully extracted from the
pictures with an Optical Characters Recognition (OCR) system.

# The *dictionaries* TEI module

Producing data useful to future other scientific projects cannot be achieved
unless it is *interoperable* and *reusable*. These are the two last key aspects
of the [FAIR](https://www.go-fair.org/fair-principles/) principles
(*findability*, *accessibility*, *interoperability* and *reusability*) which we
strive to follow as a guideline for efficient and quality research. It entails
using standard formats and a standard for encoding historical texts in the
context of digital humanities is XML-TEI, collectively developped by the *Text
Encoding Initiative* consortium. It consists in a set of technical
specifications under the form of XML schemas, along with a range of tools to
handle them and training resources.

The XML-TEI standard has a modular structure consisting in optional parts each
covering specific needs such as the physical features of a source document, the
transcription of oral corpora or particular requirements for textual domains
like poetry, or, in our case, dictionaries.

In what follows, we need to name and manipulate XML elements. We choose to
represent them in a monospace font, in the standard XML autoclosing form within
angle brackets and with a slash following the element name like `<div/>` for a
[`div` element](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html).
We do not mean by this notation that they cannot contain raw text or other XML
elements, merely that we are referring to such an element, with all the subtree
that spans from it in the context of a concrete document instance or as an empty
structure when we are considering the abstract element and the rules that govern
its use in relation to other elements or its attributes.

## Content

## A graph problem

The XML-TEI specification contains 590 elements, which are each documented on
the consortium's website in the online reference pages. With an average of
almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step. We transform the problem by
representing this network as a directed graph, using elements of XML-TEI as
nodes and placing edges if the destination node may be contained within the
source node according to the schema.

![The subgraph of the *dictionaries* module](ressources/dictionaries.png)

By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
an edge" we define *inclusion paths* which allow us to explore which elements
may be nested under one another. The nodes visited along the way represent the
intermediate XML elements to construct a valid XML tree according to the TEI
schema. Given the top-down semantics of those trees, we call the length of an
inclusion path its *depth*.

Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959)
allows us to explore the shortest inclusion paths that exist between elements.
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides us with an
efficient way to check whether a given element may or not be nested at all under
another one and gives an order of magnitude on the length of the path to expect.
Of course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build. This
is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
element which cross module boundaries freely. The general graph formalism
enables us to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.

For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
through a `<form/>` or a `<gramGrp/>`. On the other hand, trying to discover the
shortest inclusion path to `<pos/>` from the `<TEI/>` root of the document
yields a `<standOff/>`, an element dedicated to store contextual data that
accompanies but is not part of the text, not unlike an annex, and widely
unrelated to the context of encoding an encyclopedia. A last relevant example on
the use of these methods can be given by querying the shortest inclusion path of
a `<pos/>` under the `<body/>` of the document: it yields an inclusion directly
through `<entryFree/>` (with an inclusion path of length 2), which unlike
`<entry/>` accepts it as a direct child node. Possibly not what we want
depending on the regularity of the articles we are encoding and the occurrence
of other grammatical information such as `<case/>` or `<gen/>` to justify the
use of the `<gramGrp/>`, but searching exhaustively for paths up to length 3
returns as expected the path through `<entry/>`, among others. Overall, we get a
good general idea: `<pos/>` does not need to be nested very deep, it can appear
quite near the "surface" of article entries.

### The `<entry/>` element

The central element of the *dictionaries* module is the `<entry/>` element meant
to encode one single entry in a dictionary, that is to say a head word
associated to its definition. It is the natural way in from the `<body/>`
element to the dictionary module: indeed, although `<body/>` may also contain
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
`<entry/>` while the latter is a device to group several related entries
together. Both can contain an `<entry/` directly while no obvious inclusion
exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which we define as strictly inferior to 5, that is twice the
average shortest depth between any two nodes) either include `<figure/>` or
`<castList/>`, two very specific elements which should not need to appear in an
article in general, showing that the purpose of `<entry/>` is not to contain an
`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
documentation but also the structure of the elements graph evidence `<entry/>`
as the natural top-most element for an article. This somewhat contrived example
hopes to further demonstrate the application of a graph-centred approach to
understand the inner workings of the XML-TEI schema.

### Information about the headword itself

Once a block for an article is created, it may contain elements useful to
represent features such as

- its written and spoken forms: `<form/>`
- a group of grammatical information: `<gramGrp/>`, that may itself contain as
  we've seen above `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to describe the
  form itself for instance, but also information about the categories it belongs
  to like `<iType/>` for its inflection class in languages with a declension
  system or `<pos/>` for its part-of-speech
- its etymology: `<etym/>
- its variants if there is a different spelling in a variety of the language or
  if it has changed through time: `<usg/>` (though it is not its only purpose)

All these are examples and by no means an exhaustive list; the complete set
provides the encoder with a toolbox to describe all the information related to
the form the entry is found at and seems general enough to accomodate the
structure of any book indexing entries by words.

### Cross-references

A common feature shared by dictionaries and encyclopedias is the ability to
connect entries together by using a word or short phrase as the link, referring
the reader to the related concept. This is known as cross-references and can
appear either when the definition of a term is adjacent to another one or to
catch alternative spellings where some readers might expect the word to appear
and redirect them to the form chosen as the reference. In XML-TEI, this is done
with the `<xr/>` element. It usually contains the whole phrase performing the
redirection, with an imperative locution like "please see […]".

The "active" part of the cross-reference, that is the very word within the
`<xr/>` that is considered to be the link or, to make a modern-day HTML
metaphor, the region that would be clickable, is represented by a `<ref/>`
element. Though it is not specific to the *dictionaries* module, we include it
in this description of the toolbox because it is particularly useful in the
context of dictionaries. This element may have a target attribute which points
to the other resource to be accessed by the interested reader.

### Content

The remaining part of entries is also usually the largest and represents the
content associated to the headword by the entry. In a dictionary, that is its
meaning.

The `<sense/>` element is a valid child for `<entry/>` and groups together a
definition of the term with `<def/>`, usage examples with `<usg/>` (another use
of this versatile element) and other high-level information such as translations
in other languages. Both `<def/>` and `<usg/>` elements may appear directly
under the `<entry/>`.

### Structural remarks

Before concluding this description of the *dictionaries* module from the
perspective of someone trying to concretely encode a particular dictionary or
encyclopedia, we make use of the graph approach again to evidence some its
aspects in terms of inclusion structure.

First, it is remarkable that all elements in the *dictionaries* module have a
cyclic inclusion path, that is to say, there is an inclusion path from each
element of this module to itself. Although having such a cycle is a widespread
property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
of the 557 elements in the other modules), all 33 elements of the *dictionaries*
module having one is far above this average. In addition, the cycles appear to
be rather short, with an average length of 2.00 versus 2.50 in the rest of the
population. This observation is all the more surprising considering the fact
that the *dictionaries* module contains short "leaf" elements like `<pos/>`
which should not obviously need to admit cycles since one rather expects them to
contain only one word, like `<pos>adj</pos>` in the example given in the
official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.

Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the *core* module (see the case of the
`<ref/>` element above), the *dictionaries* module appears somewhat isolated
from important structural elements like `<head/>` or `<div/>`. Indeed, computing
all the paths from either `<entry/>` or `<sense/>` elements to the latter of
length shorter or equal to 5 by a systematic traversal of the graph yields
exclusively paths (respectively 9042 and 39093 of them) containing either a
`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
suggests, is used to encode text that doesn't quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a `<body/>` as direct child of
`<floatingText/>`, neatly separating its content as independent. The purpose of
the second one, although its name — short for apparatus — is less clear, is to
wrap together several versions of the same excerpts, for instance when there are
several possible readings of an unclear group of words in a manuscript, or when
the encoder is trying to compile a single version of a piece of work from
several sources which disagree over some passage. In both case, it appears
obvious that it is not something that is expected to occur naturally in the
course of an article in the general case.

Thus, despite a rather dense internal connectivity, the *dictionaries* module
fails to provide encoders with a device to represent recursively nesting
structures like `<div/>`.

# A new standard ?

Studying the content of *La Grande Encyclopédie* and considering several
articles in particular, we identify structures which are specific to
encyclopedias and not compatible with the *dictionaries* module presented above.
We hence conclude that this module is not able to encode arbitrary encyclopedic
content and propose a new fully TEI-compliant encoding scheme remaining outside
of it.

## Idiosynchrasies of encyclopedias

Browsing through the pages of an encyclopedia reveals a certain number of
noticeable differences. It is difficult to make a precise list because the
editorial choices may vary greatly between encyclopedias but we discuss some of
the most obvious.

### Organised knowledge

The first immediately visible feature that sets encyclopedias apart from
dictionaris can be found in the *Encyclopédie* as well in *La Grande
Encyclopédie* is the presence of subject indicators at the begining of articles
right after the headword which organise them into a domain classification
system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law.

No element in the *dictionaries* module is explicitely designed for the purpose
of encoding these indicators. As we have seen above, the elements set is geared
towards the words themselves instead of the concept they represent. The closest
tool for what we need is found in the `<usg/>` element used with a specific
`type` attribute set to `dom` for "domain". Indeed several examples from the
documentation encode subject indicators very similar to the ones found in
encyclopedias within this element, but the match is not perfect either: all
appear within one of multiple senses, as if to clarify each context in which the
word can be used, as expected from the element's name, "usage". In encyclopedia,
if the domain indicator does in certain cases help to distinguish between
several entries sharing the same headword, the concept itself has evolved beyond
this mere distinction. Looking back at the *Encyclopédie*, the adjective
*raisonné* in the rest of the title directly introduces a notion of structure
that links back to the "Systême figuré des connoissances humaines". The authors
have devised a branching system to classify all knowledge, and the occurrence at
the begining of articles, more than a tool to clear up possible ambiguities also
points the reader to the correct place in this mind map.

!["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie](ressources/arbre.png){width=200px}

The situation regarding subject indicators is hardly better outside of the
module. The `<domain/>` element despite its name belongs exclusively in the
header of a document and focuses on the social context of the text, not on the
knowledge area it covers. The `<interp/>` despite its name is not so much about
labeling something as an interpretation to give to a context (which subject
indicators could be if you consider that, placed at the begining, they are used
to orient the mind frame of the readers towards a particular subject). However,
the documentation clearly demonstrates it as a tool for annotators of a
document, which text content is not part of the original document but some
additional result of an analysis performed in the context of the encoding, used
only throughout references in XML attributes.

This point, although not the most concerning, still remains the hardest to
address but all things considered the `<usg/>` element stands out as the most
relevant.

### The notion of meaning

### Nested structures

### Candidates in the *dictionaries* module

- `<sense/>`
- `<entryFree/>`
- `<note/>`
- `<dictScrap/>` / `<floatingText/>`

## Encoding within the *core* module

The above remarks explain why the *dictionary* module by itself is unable to
represent encyclopedias, where discourse with nested structures of arbitrary
depth can occur. Since the *core* module of course accomodates these structures
by means of the `<div/>`, `<head/>` and `<p/>` elements, we devise an encoding
scheme using them which we recommend using for other projects aiming at
representing encyclopedias.

To remain consistent with the above remarks we will only concern ourselves with
what happens at the level of each article, right under the `<body/>` element.
Everything related to metadata happens as expected in the file's `<teiHeader/>`
which is well-enough equiped to handle them. In order to present our scheme
throughout the following section we will be progressively encoding a reference
article, "Cathète" from tome 9.

![La Grande Encyclopédie, tome 9, article "Cathète"](ressources/cathète_t9.png)

### The scheme

Each article is represented by a `<div/>`. We suggest setting an `xml:id`
attribute on it with as value the — unique, or made so by suffixing a number
representing its rank among the various occurrences, even when there's only one
for the sake of regularity — head word of the entry, normalised to lowercase,
stripping spaces and replacing all non-alphanumerical characters by a dash `'-'`
to avoid issues with the XML encoding.

![](snippets/cathète_0.png)

Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
This element should also contain the optional subject indicator within
parenthesis that sometimes accompany the headword, with the appropriate standard
elements like `<persName/>` occurring in biographical articles or `<interp/>`
with a `theme` attribute if the article is given a specific domain in a
taxonomy.

![](snippets/cathète_1.png)

We propose to then wrap each different meaning in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would've
been used within the *core* module. Each sense should be numbered with the `n`
attribute.

![](snippets/cathète_2.png)

In addition, each line within the article must start with a `<lb/>` to mark its
begining including before the `<head/>` element, which, although a surprising
setup, underlines the fact that in the dense layout of encyclopedias, the
carriage return separating two articles is meaningful. Stating each new line
explicitly keeps enough information to reconstruct a faithful facsimile but it
also has the advantage of highlighting the fact than even though the definition
is cut from the headword by being in a separate XML element, they still occur on
the same line, which is a typographic choice usually made both in encyclopedias
and dictionaries where space is at a premium.

Finally, the various sections and sub-sections occurring within the article body
may be nested as usual with `<div/>` and sub-`<div/>`s, filled with `<p/>` for
paragraphs which can each be titled with `<head/>` elements local to each
`<div/>`.

![](snippets/cathète_3.png)

But a typical page of an encyclopedia also features peritext elements, giving
information to the reader about the current page number along with the headwords
of the first and last articles appearing on the page.

Depending

Moreover, the layout is
often 

### Currently implemented

The reference implementation for this encoding scheme is the program
soprano[^soprano] developed within the scope of project DISCO-LGE to
automatically identify individual articles in the flow of raw text from the
column and to encode them into XML-TEI files. Though this software has already
been used to produce the first TEI version of *La Grande Encyclopédie*, it
doesn't yet follow the above specification perfectly. Here is for instance the
encoded version of article "Cathète" currently it produces:

[^soprano]:
  [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)

![](snippets/cathète_current.png)

The headword detection system is not able to capture the subject indicators yet
so it appears outside of the `<head/>` element. Likewise, since the detection of
titles at the begining of each section isn't complete, no structure analysis is
performed on the content of the article

## The constraints of automated processing

## Comparison to other approaches

### Bend the semantics

### Custom schema

# Conclusion

Despite long discussions and interesting proposals each with strong arguments both in
favour of and against them, no consensus could be reached. For one part, each
projects have specific constraints depending on the type of study it intends to
carry, the volume of text, or the condition of the physical source documents.

Beyond the technical need for encodings generic enough to share the corpora
within the community and compare the results accross various projects, the above
results highlights one aspect of a well-known fact within the community of
lexicography: encyclopedias and dictionaries differ on several key aspects