-
Alice Brenon authored433fa511
- Dictionaries and encyclopedias
- "Encyclopedia"
- A different approach
- La Grande Encyclopédie
- The dictionaries TEI module
- Content
- A graph problem
- The <entry/> element
- Information about the headword itself
- Cross-references
- Content
- Structural remarks
- A new standard ?
- Idiosynchrasies of encyclopedias
- The notion of meaning
- Nested structures
- Candidates in the dictionaries module
- Encoding within the core module
- The scheme
- Currently implemented
- The constraints of automated processing
- Comparison to other approaches
- Bend the semantics
- Custom schema
- Conclusion
title: The specificities of encoding encyclopedias: towards a new standard ?
author: Alice BRENON
numbersections: True
header-includes:
\usepackage{textalpha}
\usepackage{hyperref}
\hypersetup{
colorlinks,
urlcolor = blue
}
Dictionaries and encyclopedias
In common parlance, the terms "dictionaries" and "encyclopedias" are used as near synonyms to refer to books compiling vast amounts of knowledge into lists of definitions ordered alphabetically. Their similarity is even visible in the way they are coordinated in the full title of the Encyclopédie ou Dictionnaire raisonné des sciences des arts et des métiers published by Diderot and d'Alembert between 1751 and 1772 and which is probably the most famous work of the genre and a symbol of the Age of Enlightenment.
"Encyclopedia"
If the word "encyclopedia" is nowadays part of our vocabulary, it was much more unusual and in fact controversial when Diderot and d'Alembert decided to use it in the title of their book.
The definition given by Furetière in his Dictionnaire Universel in 1690 is still close to its greek etymology: a "ring of all knowledges", from κύκλος, "circle", and παιδεία, "knowledge". This meaning is the one used for instance by Rabelais in Pantagruel, when he has Thaumaste declare that Panurge opened to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of Encyclopedia"). At the time the word still mostly refers to the abstract concept of mastering all knowledges at once. Furetière adds that it's a quality one is unlikely to possess, and even seems to condemn its search as a form of hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" ("it is a recklessness for a man to want to possess Encyclopedia").
Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated at the end of the 17\textsuperscript{th} century and attacked in the Dictionnaire Universel François et Latin, commonly refered to as the Dictionnaire de Trevoux, as utterly "burlesque" ("parodic"). The entry for "Encyclopédie" remained unchanged in the four editons issued between 1721 and 1752, mocking the use of the word and discouraging his readers to pursue it. In that intent, he quotes a poem from Pibrac encouraging people to specialize in only one discipline lest they should not reach perfection, based on an argumentation that resembles the saying "Jack of all trades, master of none". It is all the more interesting that the definition remains unaltered until 1752, one year after the publication of the first volume of the Encyclopédie. The Jesuites who edited Dictionnaire de Trevoux frowned upon the project of the Encyclopédie which they managed to get banned the same year by the Council of State on the charge of attempting to destroy the royal authority, inspiring rebellion and corrupting morality in general. There is much more at stake than words here, but the attempt to deprecate the word itself is part of their fight against the philosophers of the Enlightenment.
The attacks do not remain ignored by Diderot who starts the very definition of the word "Encyclopédie" in the Encyclopédie itself by a strong rebuttal. He directly dismisses the concerns expressed in the Dictionnaire de Trevoux as mere self-doubt that their authors shouldn't generalize to mankind, then leaves the main point to a latin quote by chancelor Bacon, who argues that a collaborative work can achieve much more than any talented man could: what could possibly not be within reach of a single man, within a single lifetime may be achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defense of the feasability of the project quite seriously, considering the fact that they got the Encyclopédie's priviledges to be revoked again six years after its publication was resumed and that its remaining volumes had to be published illegally until its end in 1772.
However, in their last edition in 1771 the authors of the Dictionnaire de Trevoux had no choice but to acknowledge the success of the encyclopedic projects of the 18\textsuperscript{th} century. In this version, the definition was entirely reworked, mildly stating that good encyclopedias are difficult to make because of the amount of knowledge necessary and work needed to keep up with scientific progress instead of calling the effort a parody. It credits Chamber's Cyclopædia for being a decent attempt before referring anonymously though quite explicitly to Diderot and d'Alembert's project by naming the collective "Une Société de gens de Lettres" and writing that it started in 1751. Even more importantly, two new entries were added after it: one for the adjective "encyclopédique" and another one for the noun "encyclopédiste", silently admitting how the project had changed its time and the relation to knowledge.
A different approach
If encyclopedia are thus historically more recent than dictionaries they also depart from the latter on their approach. The purpose of dictionaries from their origin is to collect words, to make an exhaustive inventory of the terms used in a domain or in a language in order to associate a definition to them, be it a translation in another language for a foreign language dictionary or a phrase explaining it for other dictionaries. As such, they are collections of signs and remain within the linguistic level of things. Entries in a dictionary often feature information such as the part of speech, the pronunciation or the etymology of the word they define.
The entry for "Dictionnaire" in the Encyclopédie distinguishes between three types of dictionaries: one to define words, the second to define facts and the last one to define things, corresponding to the distinction between language, history, and science and arts dictionaries although according to its author, d'Alembert, each has to be of more than just one kind to be really good. In the full title of the Encyclopédie, the concept is more or less equated by means of the coordinating conjunction "ou" to a Dictionnaire raisonné, "reasoned dictionary", introducing the idea of encyclopedias as dictionaries with additional structure and a philosophical dimension.
Back to the "Encyclopédie" article we read that a dictionary remaining strictly at the language level, a vocabulary, can be seen as the empty frame required for an encyclopedic dictionary that will fill it with additional depth. Given how d'Alembert insists on the importance of brevity for a clear definition in the "Dictionnaire de Langues" entry, it is clear that for the encyclopédistes, encyclopedia aren't superior to dictionaries but really depart from them in terms of purpose.
La Grande Encyclopédie
After emerging from dictionaries during the 18\textsuperscript{th} century, encyclopedias became a fertile subgenre in themselves which kept evolving over the following centuries. One of offsprings of the Encyclopédie from the 19\textsuperscript{th} century is entitled La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des Arts par une Société de savants et de gens de lettres and was published between 1885 and 1902 by an organized team of over two hundred specialists divided into eleven sections. The aim of CollEx-Persée project DISCO-LGE was to digitize and make La Grande Encyclopédie available to the scientific community as well as the general public. A previous version was partially available on Gallica but lacked in quality and its text had not been fully extracted from the pictures with an Optical Characters Recognition (OCR) system.
The dictionaries TEI module
Producing data useful to future other scientific projects cannot be achieved unless it is interoperable and reusable. These are the two last key aspects of the FAIR principles (findability, accessibility, interoperability and reusability) which we strive to follow as a guideline for efficient and quality research. It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the Text Encoding Initiative consortium. It consists in a set of technical specifications under the form of XML schemas, along with a range of tools to handle them and training resources.
The XML-TEI standard has a modular structure consisting in optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains like poetry, or, in our case, dictionaries.
In what follows, we need to name and manipulate XML elements. We choose to
represent them in a monospace font, in the standard XML autoclosing form within
angle brackets and with a slash following the element name like <div/>
for a
div
element.
We do not mean by this notation that they cannot contain raw text or other XML
elements, merely that we are referring to such an element, with all the subtree
that spans from it in the context of a concrete document instance or as an empty
structure when we are considering the abstract element and the rules that govern
its use in relation to other elements or its attributes.
Content
A graph problem
The XML-TEI specification contains 590 elements, which are each documented on the consortium's website in the online reference pages. With an average of almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step. We transform the problem by representing this network as a directed graph, using elements of XML-TEI as nodes and placing edges if the destination node may be contained within the source node according to the schema.
By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by an edge" we define inclusion paths which allow us to explore which elements may be nested under one another. The nodes visited along the way represent the intermediate XML elements to construct a valid XML tree according to the TEI schema. Given the top-down semantics of those trees, we call the length of an inclusion path its depth.
Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959) allows us to explore the shortest inclusion paths that exist between elements. Though a particular caution should be applied because there is no guarantee that the shortest path is meaningful in general, it at least provides us with an efficient way to check whether a given element may or not be nested at all under another one and gives an order of magnitude on the length of the path to expect. Of course the accuracy of this heuristic decreases as the length of the elements increases in a perfect graph representing the intended, meaningful path between two nodes, but the general graph formalism enables us to extend the results produced by the shortest-path approach and consider elements combinations rationally and exhaustively by algorithmic means should the need occur.
For instance, it lets one find that although <pos/>
may not be directly
included within <entry/>
elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
through a <form/>
or a <gramGrp/>
. On the other hand, trying to discover the
shortest inclusion path to <pos/>
from the <TEI/>
root of the document
yields a <standOff/>
, an element dedicated to store contextual data that
accompanies but is not part of the text, not unlike an annex, and widely
unrelated to the context of encoding an encyclopedia. A last relevant example on
the use of these methods can be given by querying the shortest inclusion path of
a <pos/>
under the <body/>
of the document: it yields an inclusion directly
through <entryFree/>
(with an inclusion path of length 2), which unlike
<entry/>
accepts it as a direct child node. Possibly not what we want
depending on the regularity of the articles we are encoding and the occurrence
of other grammatical information such as <case/>
or <gen/>
to justify the
use of the <gramGrp/>
, but searching exhaustively for paths up to length 3
returns as expected the path through <entry/>
, among others. Overall, we get a
good general idea: <pos/>
does not need to be nested very deep, it can appear
quite near the "surface" of article entries.
<entry/>
element
The The central element of the dictionaries module is the <entry/>
element meant
to encode one single entry in a dictionary, that is to say a head word
associated to its definition. It is the natural way in from the <body/>
element to the dictionary module: indeed, although <body/>
may also contain
<entryFree/>
or <superEntry/>
elements, the former is a relaxed version of
<entry/>
while the latter is a device to group several related entries
together. Both can contain an <entry/
directly while no obvious inclusion
exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which we define as strictly inferior to 5, that is twice the
average shortest depth between any two nodes) either include <figure/>
or
<castList/>
, two very specific elements which should not need to appear in an
article in general, showing that the purpose of <entry/>
is not to contain an
<entryFree/>
or <superEntry/>
. Hence, not only the semantics conveyed by the
documentation but also the structure of the elements graph evidence <entry/>
as the natural top-most element for an article. This somewhat contrived example
hopes to further demonstrate the application of a graph-centered approach to
understand the inner workings of the XML-TEI schema.
Information about the headword itself
Once a block for an article is created, it may contain elements useful to represent features such as
- its written and spoken forms:
<form/>
- a group of grammatical information:
<gramGrp/>
, that may itself contain as we've seen above<case/>
,<gen/>
,<number/>
or<pers/>
to describe the form itself for instance, but also information about the categories it belongs to like<iType/>
for its inflection class in languages with a declension system or<pos/>
for its part-of-speech - its etymology: `
- its variants if there is a different spelling in a variety of the language or
if it has changed through time:
<usg/>
(though it is not its only purpose)
All these are examples and by no means an exhaustive list; the complete set provides the encoder with a toolbox to describe all the information related to the form the entry is found at and seems general enough to accomodate the structure of any book indexing entries by words.
Cross-references
A common feature shared by dictionaries and encyclopedias is the ability to
connect entries together by using a word or short phrase as the link, referring
the reader to the related concept. This is known as cross-references and can
appear either when the definition of a term is adjacent to another one or to
catch alternative spellings where some readers might expect the word to appear
and redirect them to the form chosen as the reference. In XML-TEI, this is done
with the <xr/>
element. It usually contains the whole phrase performing the
redirection, with an imperative locution like "please see […]".
The "active" part of the cross-reference, that is the very word within the
<xr/>
that is considered to be the link or, to make a modern-day HTML
metaphor, the region that would be clickable, is represented by a <ref/>
element. Though it is not specific to the dictionaries module, we include it
in this description of the toolbox because it is particularly useful in the
context of dictionaries. This element may have a target attribute which points
to the other resource to be accessed by the interested reader.
Content
The remaining part of entries is also usually the largest and represents the content associated to the headword by the entry. In a dictionary, that is its meaning.
The <sense/>
element is a valid child for <entry/>
and groups together a
definition of the term with <def/>
, usage examples with <usg/>
(another use
of this versatile element) and other high-level information such as translations
in other languages. Both <def/>
and <usg/>
elements may appear directly
under the <entry/>
.
Structural remarks
Before concluding this description of the dictionaries module from the perspective of someone trying to concretely encode a particular dictionary or encyclopedia, we make use of the graph approach again to evidence some its aspects in terms of inclusion structure.
First, it is remarkable that all elements in the dictionaries module have a
cyclic inclusion path, that is to say, there is an inclusion path from each
element of this module to itself. Although having such a cycle is a widespread
property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
of the 557 elements in the other modules), all 33 elements of the dictionaries
module having one is far above this average. In addition, the cycles appear to
be rather short, with an average length of 2.00 versus 2.50 in the rest of the
population. This observation is all the more surprising considering the fact
that the dictionaries module contains short "leaf" elements like <pos/>
which should not obviously need to admit cycles since one rather expects them to
contain only one word, like <pos>adj</pos>
in the example given in the
official documentation. Among those (shortest) cycles, 20 include the <cit/>
element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the core module (see the case of the
<ref/>
element above), the dictionaries module appears somewhat isolated
from important structural elements like <head/>
or <div/>
. Indeed, computing
all the paths from either <entry/>
or <sense/>
elements to the latter of
length shorter or equal to 5 by a systematic traversal of the graph yields
exclusively paths (respectively 9042 and 39093 of them) containing either a
<floatingText/>
or an <app/>
element. The first one, as its name aptly
suggests, is used to encode text that doesn't quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a <body/>
as direct child of
<floatingText/>
, neatly separating its content as independent. The purpose of
the second one, although its name — short for apparatus — is less clear, is to
wrap together several versions of the same excerpts, for instance when there are
several possible readings of an unclear group of words in a manuscript, or when
the encoder is trying to compile a single version of a piece of work from
several sources which disagree over some passage. In both case, it appears
obvious that it is not something that is expected to occur naturally in the
course of an article in the general case.
Thus, despite a rather dense internal connectivity, the dictionaries module
fails to provide encoders with a device to represent recursively nesting
structures like <div/>
.
A new standard ?
Studying the content of La Grande Encyclopédie and considering several articles in particular, we identify structures specific to encyclopedias which are not covered by the dictionaries module presented above. We hence conclude that this module is not able to encode arbitrary encyclopedic content and propose a new encoding scheme.
Idiosynchrasies of encyclopedias
The notion of meaning
Nested structures
Candidates in the dictionaries module
<sense/>
<entryFree/>
<note/>
-
<dictScrap/>
/<floatingText/>
Encoding within the core module
The above remarks explain why the dictionary module by itself is unable to
represent encyclopedias, where discourse with nested structures of arbitrary
depth can occur. Since the core module of course accomodates these structures
by means of the <div/>
, <head/>
and <p/>
elements, we devise an encoding
scheme using them which we recommend using for other projects aiming at
representing encyclopedias.
To remain consistent with the above remarks we will only concern ourselves with
what happens at the level of each article, right under the <body/>
element.
Everything related to metadata happens as expected in the file's <teiHeader/>
which is well-enough equiped to handle them. In order to present our scheme
throughout the following section we will be progressively encoding a reference
article, "Cathète" from tome 9.
The scheme
Each article is represented by a <div/>
. We suggest setting an xml:id
attribute on it with as value the — unique, or made so by suffixing a number
representing its rank among the various occurrences, even when there's only one
for the sake of regularity — head word of the entry, normalized to lowercase,
stripping spaces and replacing all non-alphanumerical characters by a dash '-'
to avoid issues with the XML encoding.
Inside this element should be a <head/>
enclosing the headword of the article.
The usual sub-<hi/>
elements are available within <head/>
if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
This element should also contain the optional subject indicator within
parenthesis that sometimes accompany the headword, with the appropriate standard
elements like <persName/>
occurring in biographical articles or <interp/>
with a theme
attribute if the article is given a specific domain in a
taxonomy.
We propose to then wrap each different meaning in a separate <div/>
with the
type
attribute set to sense
to refer to the <sense/>
element that would've
been used within the core module. Each sense should be numbered with the n
attribute.
In addition, each line within the article must start with a <lb/>
to mark its
begining including before the <head/>
element, which, although a surprising
setup, underlines the fact that in the dense layout of encyclopedias, the
carriage return separating two articles is meaningful. Stating each new line
explicitly keeps enough information to reconstruct a faithful facsimile but it
also has the advantage of highlighting the fact than even though the definition
is cut from the headword by being in a separate XML element, they still occur on
the same line, which is a typographic choice usually made both in encyclopedias
and dictionaries where space is at a premium.
Finally, the various sections and sub-sections occurring within the article body
may be nested as usual with <div/>
and sub-<div/>
s, filled with <p/>
for
paragraphs which can each be titled with <head/>
elements local to each
<div/>
.
But a typical page of an encyclopedia also features peritext elements, giving information to the reader about the current page number along with the headwords of the first and last articles appearing on the page.
Depending
Moreover, the layout is often
Currently implemented
The reference implementation for this encoding scheme is the program soprano1 developed within the scope of project DISCO-LGE to automatically identify individual articles in the flow of raw text from the column and to encode them into XML-TEI files. Though this software has already been used to produce the first TEI version of La Grande Encyclopédie, it doesn't yet follow the above specification perfectly. Here is for instance the encoded version of article "Cathète" currently it produces:
https://gitlab.huma-num.fr/disco-lge/soprano
The headword detection system is not able to capture the subject indicators yet
so it appears outside of the <head/>
element. Likewise, since the detection of
titles at the begining of each section isn't complete, no structure analysis is
performed on the content of the article
The constraints of automated processing
Comparison to other approaches
Bend the semantics
Custom schema
Conclusion
Despite long discussions and interesting proposals each with strong arguments both in favour of and against them, no consensus could be reached. For one part, each projects have specific constraints depending on the type of study it intends to carry, the volume of text, or the condition of the physical source documents.
Beyond the technical need for encodings generic enough to share the corpora within the community and compare the results accross various projects, the above results highlights one aspect of a well-known fact within the community of lexicography: encyclopedias and dictionaries differ on several key aspects