Snippets Groups Projects

Oops, bad usage of data; rephrased the whole sentence · 433fa511
Alice Brenon authored 3 years ago

433fa511

ICHLL_Brenon.md 25.35 KiB

title: The specificities of encoding encyclopedias: towards a new standard ?
author: Alice BRENON
numbersections: True
header-includes:
	\usepackage{textalpha}
	\usepackage{hyperref}
	\hypersetup{
		colorlinks,
		urlcolor = blue
	}

Dictionaries and encyclopedias

In common parlance, the terms "dictionaries" and "encyclopedias" are used as near synonyms to refer to books compiling vast amounts of knowledge into lists of definitions ordered alphabetically. Their similarity is even visible in the way they are coordinated in the full title of the Encyclopédie ou Dictionnaire raisonné des sciences des arts et des métiers published by Diderot and d'Alembert between 1751 and 1772 and which is probably the most famous work of the genre and a symbol of the Age of Enlightenment.

"Encyclopedia"

If the word "encyclopedia" is nowadays part of our vocabulary, it was much more unusual and in fact controversial when Diderot and d'Alembert decided to use it in the title of their book.

The definition given by Furetière in his Dictionnaire Universel in 1690 is still close to its greek etymology: a "ring of all knowledges", from κύκλος, "circle", and παιδεία, "knowledge". This meaning is the one used for instance by Rabelais in Pantagruel, when he has Thaumaste declare that Panurge opened to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of Encyclopedia"). At the time the word still mostly refers to the abstract concept of mastering all knowledges at once. Furetière adds that it's a quality one is unlikely to possess, and even seems to condemn its search as a form of hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" ("it is a recklessness for a man to want to possess Encyclopedia").

Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated at the end of the 17\textsuperscript{th} century and attacked in the Dictionnaire Universel François et Latin, commonly refered to as the Dictionnaire de Trevoux, as utterly "burlesque" ("parodic"). The entry for "Encyclopédie" remained unchanged in the four editons issued between 1721 and 1752, mocking the use of the word and discouraging his readers to pursue it. In that intent, he quotes a poem from Pibrac encouraging people to specialize in only one discipline lest they should not reach perfection, based on an argumentation that resembles the saying "Jack of all trades, master of none". It is all the more interesting that the definition remains unaltered until 1752, one year after the publication of the first volume of the Encyclopédie. The Jesuites who edited Dictionnaire de Trevoux frowned upon the project of the Encyclopédie which they managed to get banned the same year by the Council of State on the charge of attempting to destroy the royal authority, inspiring rebellion and corrupting morality in general. There is much more at stake than words here, but the attempt to deprecate the word itself is part of their fight against the philosophers of the Enlightenment.

The attacks do not remain ignored by Diderot who starts the very definition of the word "Encyclopédie" in the Encyclopédie itself by a strong rebuttal. He directly dismisses the concerns expressed in the Dictionnaire de Trevoux as mere self-doubt that their authors shouldn't generalize to mankind, then leaves the main point to a latin quote by chancelor Bacon, who argues that a collaborative work can achieve much more than any talented man could: what could possibly not be within reach of a single man, within a single lifetime may be achieved by a common effort throughout generations.

History hints that Diderot's opponents took his defense of the feasability of the project quite seriously, considering the fact that they got the Encyclopédie's priviledges to be revoked again six years after its publication was resumed and that its remaining volumes had to be published illegally until its end in 1772.

However, in their last edition in 1771 the authors of the Dictionnaire de Trevoux had no choice but to acknowledge the success of the encyclopedic projects of the 18\textsuperscript{th} century. In this version, the definition was entirely reworked, mildly stating that good encyclopedias are difficult to make because of the amount of knowledge necessary and work needed to keep up with scientific progress instead of calling the effort a parody. It credits Chamber's Cyclopædia for being a decent attempt before referring anonymously though quite explicitly to Diderot and d'Alembert's project by naming the collective "Une Société de gens de Lettres" and writing that it started in 1751. Even more importantly, two new entries were added after it: one for the adjective "encyclopédique" and another one for the noun "encyclopédiste", silently admitting how the project had changed its time and the relation to knowledge.

A different approach

If encyclopedia are thus historically more recent than dictionaries they also depart from the latter on their approach. The purpose of dictionaries from their origin is to collect words, to make an exhaustive inventory of the terms used in a domain or in a language in order to associate a definition to them, be it a translation in another language for a foreign language dictionary or a phrase explaining it for other dictionaries. As such, they are collections of signs and remain within the linguistic level of things. Entries in a dictionary often feature information such as the part of speech, the pronunciation or the etymology of the word they define.

The entry for "Dictionnaire" in the Encyclopédie distinguishes between three types of dictionaries: one to define words, the second to define facts and the last one to define things, corresponding to the distinction between language, history, and science and arts dictionaries although according to its author, d'Alembert, each has to be of more than just one kind to be really good. In the full title of the Encyclopédie, the concept is more or less equated by means of the coordinating conjunction "ou" to a Dictionnaire raisonné, "reasoned dictionary", introducing the idea of encyclopedias as dictionaries with additional structure and a philosophical dimension.

Back to the "Encyclopédie" article we read that a dictionary remaining strictly at the language level, a vocabulary, can be seen as the empty frame required for an encyclopedic dictionary that will fill it with additional depth. Given how d'Alembert insists on the importance of brevity for a clear definition in the "Dictionnaire de Langues" entry, it is clear that for the encyclopédistes, encyclopedia aren't superior to dictionaries but really depart from them in terms of purpose.

La Grande Encyclopédie

After emerging from dictionaries during the 18\textsuperscript{th} century, encyclopedias became a fertile subgenre in themselves which kept evolving over the following centuries. One of offsprings of the Encyclopédie from the 19\textsuperscript{th} century is entitled La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des Arts par une Société de savants et de gens de lettres and was published between 1885 and 1902 by an organized team of over two hundred specialists divided into eleven sections. The aim of CollEx-Persée project DISCO-LGE was to digitize and make La Grande Encyclopédie available to the scientific community as well as the general public. A previous version was partially available on Gallica but lacked in quality and its text had not been fully extracted from the pictures with an Optical Characters Recognition (OCR) system.

The dictionaries TEI module

Producing data useful to future other scientific projects cannot be achieved unless it is interoperable and reusable. These are the two last key aspects of the FAIR principles (findability, accessibility, interoperability and reusability) which we strive to follow as a guideline for efficient and quality research. It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the Text Encoding Initiative consortium. It consists in a set of technical specifications under the form of XML schemas, along with a range of tools to handle them and training resources.

The XML-TEI standard has a modular structure consisting in optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains like poetry, or, in our case, dictionaries.

In what follows, we need to name and manipulate XML elements. We choose to represent them in a monospace font, in the standard XML autoclosing form within angle brackets and with a slash following the element name like <div/> for a div element. We do not mean by this notation that they cannot contain raw text or other XML elements, merely that we are referring to such an element, with all the subtree that spans from it in the context of a concrete document instance or as an empty structure when we are considering the abstract element and the rules that govern its use in relation to other elements or its attributes.

Content

A graph problem

The XML-TEI specification contains 590 elements, which are each documented on the consortium's website in the online reference pages. With an average of almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step. We transform the problem by representing this network as a directed graph, using elements of XML-TEI as nodes and placing edges if the destination node may be contained within the source node according to the schema.

By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by an edge" we define inclusion paths which allow us to explore which elements may be nested under one another. The nodes visited along the way represent the intermediate XML elements to construct a valid XML tree according to the TEI schema. Given the top-down semantics of those trees, we call the length of an inclusion path its depth.

Using classical, well-known methods such as Dĳkstra's algorithm (Dĳkstra, 1959) allows us to explore the shortest inclusion paths that exist between elements. Though a particular caution should be applied because there is no guarantee that the shortest path is meaningful in general, it at least provides us with an efficient way to check whether a given element may or not be nested at all under another one and gives an order of magnitude on the length of the path to expect. Of course the accuracy of this heuristic decreases as the length of the elements increases in a perfect graph representing the intended, meaningful path between two nodes, but the general graph formalism enables us to extend the results produced by the shortest-path approach and consider elements combinations rationally and exhaustively by algorithmic means should the need occur.

For instance, it lets one find that although <pos/> may not be directly included within <entry/> elements to include information about the part-of-speech of the word that an article defines, the correct way to do so is through a <form/> or a <gramGrp/>. On the other hand, trying to discover the shortest inclusion path to <pos/> from the <TEI/> root of the document yields a <standOff/>, an element dedicated to store contextual data that accompanies but is not part of the text, not unlike an annex, and widely unrelated to the context of encoding an encyclopedia. A last relevant example on the use of these methods can be given by querying the shortest inclusion path of a <pos/> under the <body/> of the document: it yields an inclusion directly through <entryFree/> (with an inclusion path of length 2), which unlike <entry/> accepts it as a direct child node. Possibly not what we want depending on the regularity of the articles we are encoding and the occurrence of other grammatical information such as <case/> or <gen/> to justify the use of the <gramGrp/>, but searching exhaustively for paths up to length 3 returns as expected the path through <entry/>, among others. Overall, we get a good general idea: <pos/> does not need to be nested very deep, it can appear quite near the "surface" of article entries.

The `<entry/>` element

The central element of the dictionaries module is the <entry/> element meant to encode one single entry in a dictionary, that is to say a head word associated to its definition. It is the natural way in from the <body/> element to the dictionary module: indeed, although <body/> may also contain <entryFree/> or <superEntry/> elements, the former is a relaxed version of <entry/> while the latter is a device to group several related entries together. Both can contain an <entry/ directly while no obvious inclusion exists the other way around: most (> 96.2%) of the inclusion paths of "reasonable" depth (which we define as strictly inferior to 5, that is twice the average shortest depth between any two nodes) either include <figure/> or <castList/>, two very specific elements which should not need to appear in an article in general, showing that the purpose of <entry/> is not to contain an <entryFree/> or <superEntry/>. Hence, not only the semantics conveyed by the documentation but also the structure of the elements graph evidence <entry/> as the natural top-most element for an article. This somewhat contrived example hopes to further demonstrate the application of a graph-centered approach to understand the inner workings of the XML-TEI schema.

Information about the headword itself

Once a block for an article is created, it may contain elements useful to represent features such as

its written and spoken forms: <form/>
a group of grammatical information: <gramGrp/>, that may itself contain as we've seen above <case/>, <gen/>, <number/> or <pers/> to describe the form itself for instance, but also information about the categories it belongs to like <iType/> for its inflection class in languages with a declension system or <pos/> for its part-of-speech
its etymology: `
its variants if there is a different spelling in a variety of the language or if it has changed through time: <usg/> (though it is not its only purpose)

All these are examples and by no means an exhaustive list; the complete set provides the encoder with a toolbox to describe all the information related to the form the entry is found at and seems general enough to accomodate the structure of any book indexing entries by words.

Cross-references

A common feature shared by dictionaries and encyclopedias is the ability to connect entries together by using a word or short phrase as the link, referring the reader to the related concept. This is known as cross-references and can appear either when the definition of a term is adjacent to another one or to catch alternative spellings where some readers might expect the word to appear and redirect them to the form chosen as the reference. In XML-TEI, this is done with the <xr/> element. It usually contains the whole phrase performing the redirection, with an imperative locution like "please see […]".

The "active" part of the cross-reference, that is the very word within the <xr/> that is considered to be the link or, to make a modern-day HTML metaphor, the region that would be clickable, is represented by a <ref/> element. Though it is not specific to the dictionaries module, we include it in this description of the toolbox because it is particularly useful in the context of dictionaries. This element may have a target attribute which points to the other resource to be accessed by the interested reader.

Content

The remaining part of entries is also usually the largest and represents the content associated to the headword by the entry. In a dictionary, that is its meaning.

The <sense/> element is a valid child for <entry/> and groups together a definition of the term with <def/>, usage examples with <usg/> (another use of this versatile element) and other high-level information such as translations in other languages. Both <def/> and <usg/> elements may appear directly under the <entry/>.

Structural remarks

Before concluding this description of the dictionaries module from the perspective of someone trying to concretely encode a particular dictionary or encyclopedia, we make use of the graph approach again to evidence some its aspects in terms of inclusion structure.

First, it is remarkable that all elements in the dictionaries module have a cyclic inclusion path, that is to say, there is an inclusion path from each element of this module to itself. Although having such a cycle is a widespread property in the remainder of XML-TEI elements shared by 73.8% of them (411 out of the 557 elements in the other modules), all 33 elements of the dictionaries module having one is far above this average. In addition, the cycles appear to be rather short, with an average length of 2.00 versus 2.50 in the rest of the population. This observation is all the more surprising considering the fact that the dictionaries module contains short "leaf" elements like <pos/> which should not obviously need to admit cycles since one rather expects them to contain only one word, like <pos>adj</pos> in the example given in the official documentation. Among those (shortest) cycles, 20 include the <cit/> element made to group quotations with a bibliographic reference to their source which should clearly be unnecessary to encode an article in the general case.

Secondly, although we have seen examples of connections from this module to the rest of the XML-TEI, especially to the core module (see the case of the <ref/> element above), the dictionaries module appears somewhat isolated from important structural elements like <head/> or <div/>. Indeed, computing all the paths from either <entry/> or <sense/> elements to the latter of length shorter or equal to 5 by a systematic traversal of the graph yields exclusively paths (respectively 9042 and 39093 of them) containing either a <floatingText/> or an <app/> element. The first one, as its name aptly suggests, is used to encode text that doesn't quite fit the regular flow of the document, as for example in the context of an embedded narrative. Both examples displayed in the online documentation feature a <body/> as direct child of <floatingText/>, neatly separating its content as independent. The purpose of the second one, although its name — short for apparatus — is less clear, is to wrap together several versions of the same excerpts, for instance when there are several possible readings of an unclear group of words in a manuscript, or when the encoder is trying to compile a single version of a piece of work from several sources which disagree over some passage. In both case, it appears obvious that it is not something that is expected to occur naturally in the course of an article in the general case.

Thus, despite a rather dense internal connectivity, the dictionaries module fails to provide encoders with a device to represent recursively nesting structures like <div/>.

A new standard ?

Studying the content of La Grande Encyclopédie and considering several articles in particular, we identify structures specific to encyclopedias which are not covered by the dictionaries module presented above. We hence conclude that this module is not able to encode arbitrary encyclopedic content and propose a new encoding scheme.

Idiosynchrasies of encyclopedias

The notion of meaning

Nested structures

Candidates in the dictionaries module

<sense/>
<entryFree/>
<note/>
<dictScrap/> / <floatingText/>

Encoding within the core module

The above remarks explain why the dictionary module by itself is unable to represent encyclopedias, where discourse with nested structures of arbitrary depth can occur. Since the core module of course accomodates these structures by means of the <div/>, <head/> and <p/> elements, we devise an encoding scheme using them which we recommend using for other projects aiming at representing encyclopedias.

To remain consistent with the above remarks we will only concern ourselves with what happens at the level of each article, right under the <body/> element. Everything related to metadata happens as expected in the file's <teiHeader/> which is well-enough equiped to handle them. In order to present our scheme throughout the following section we will be progressively encoding a reference article, "Cathète" from tome 9.

The scheme

Each article is represented by a <div/>. We suggest setting an xml:id attribute on it with as value the — unique, or made so by suffixing a number representing its rank among the various occurrences, even when there's only one for the sake of regularity — head word of the entry, normalized to lowercase, stripping spaces and replacing all non-alphanumerical characters by a dash '-' to avoid issues with the XML encoding.

Inside this element should be a <head/> enclosing the headword of the article. The usual sub-<hi/> elements are available within <head/> if the headword is highlighted by any special typographic means such as bold, small capitals, etc. This element should also contain the optional subject indicator within parenthesis that sometimes accompany the headword, with the appropriate standard elements like <persName/> occurring in biographical articles or <interp/> with a theme attribute if the article is given a specific domain in a taxonomy.

We propose to then wrap each different meaning in a separate <div/> with the type attribute set to sense to refer to the <sense/> element that would've been used within the core module. Each sense should be numbered with the n attribute.

In addition, each line within the article must start with a <lb/> to mark its begining including before the <head/> element, which, although a surprising setup, underlines the fact that in the dense layout of encyclopedias, the carriage return separating two articles is meaningful. Stating each new line explicitly keeps enough information to reconstruct a faithful facsimile but it also has the advantage of highlighting the fact than even though the definition is cut from the headword by being in a separate XML element, they still occur on the same line, which is a typographic choice usually made both in encyclopedias and dictionaries where space is at a premium.

Finally, the various sections and sub-sections occurring within the article body may be nested as usual with <div/> and sub-<div/>s, filled with <p/> for paragraphs which can each be titled with <head/> elements local to each <div/>.

But a typical page of an encyclopedia also features peritext elements, giving information to the reader about the current page number along with the headwords of the first and last articles appearing on the page.

Depending

Moreover, the layout is often

Currently implemented

The reference implementation for this encoding scheme is the program soprano¹ developed within the scope of project DISCO-LGE to automatically identify individual articles in the flow of raw text from the column and to encode them into XML-TEI files. Though this software has already been used to produce the first TEI version of La Grande Encyclopédie, it doesn't yet follow the above specification perfectly. Here is for instance the encoded version of article "Cathète" currently it produces:

https://gitlab.huma-num.fr/disco-lge/soprano

The headword detection system is not able to capture the subject indicators yet so it appears outside of the <head/> element. Likewise, since the detection of titles at the begining of each section isn't complete, no structure analysis is performed on the content of the article

The constraints of automated processing

Comparison to other approaches

Bend the semantics

Custom schema

Conclusion

Despite long discussions and interesting proposals each with strong arguments both in favour of and against them, no consensus could be reached. For one part, each projects have specific constraints depending on the type of study it intends to carry, the volume of text, or the condition of the physical source documents.

Beyond the technical need for encodings generic enough to share the corpora within the community and compare the results accross various projects, the above results highlights one aspect of a well-known fact within the community of lexicography: encyclopedias and dictionaries differ on several key aspects

↩