diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index a413530a5dcc02cef94cbadf8a79cefd8868007b..c091d5adffe708c1a6e85c0b54eccd24bc76eddb 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -28,32 +28,89 @@ header-includes: {\small \textsuperscript{2} Univ Lyon, INSA Lyon, CNRS, UCBL, LIRIS, UMR5205, F-69621}\\ \end{center} -**Abstract** As witnesses to scientific progress, dictionaries and encyclopedias -draw much interest from digital humanities, which accounts for the number of -projects making them available to the public or studying them. However, the -volume of data involved issues a technical challenge to the digitizing process -required for the study of historical dictionaries. The goal of project DISCO-LGE -was to study a late-19^th^ century encyclopedia, "La Grande Encyclopédie", -working from an OCRised version in XML-ALTO up to an encoding suitable for an -automatic tool to represent and structure the text of encyclopedias. XML-TEI, a -major standard, includes a specialised module for dictionaries which was -identified as a good candidate to build on, but systematic traversal of the +**Abstract** This chapter illustrates the fundamental differences between +dictionaries and encyclopedias by documenting the process of devising an +encoding scheme and applying it to a late-19^th^ century encyclopedia, "La +Grande Encyclopédie" (hence *LGE*). The effort, made in the context of project +DISCO-LGE, consisted in working from an OCRised version of the pages in XML-ALTO +to produce a fully XML-TEI-compliant encoding of the individual articles. +Although the TEI guidelines include a specialised module for dictionaries which +was identified as a promising tool for the task, systematic traversal of the schema using graph search methods revealed some limitations when used to encode -this text. These shortcomings are described which leads to the identification of -the fundamental differences that prevent encoding encyclopedias with the XML-TEI -module for dictionaries. Alternative encodings for encyclopedias including a -fully XML-TEI-compliant scheme are then proposed along with a discussion of -their advantages and drawbacks. . +this text. These shortcomings are reviewed and illustrated on a series of +examples. An alternative encoding remaining within the *core* module of TEI is +then proposed and demonstrated on articles from *LGE* containing key features. +Finally, different strategies followed by other projects are discussed. **Keywords** digital humanities, XML-TEI, dictionaries, encyclopedias +# Introduction + +Although both terms have been used rather interchangeably over the past few +centuries, a dichotomy is now commonly being made between dictionaries and +encyclopedias. A simple oppositon can easily justify this distinction: +dictionaries define words and tell one how to use them while encyclopedia +usually go into longer development to give a more comprehensive and scientific +understanding of the concept being defined. This common intuition links back to +the entry written in the *Encyclopédie ou Dictionnaire raisonné des sciences des +arts et des métiers* (hence *EDdA*) by @dalembert_dictionnaire_2022 [article +DICTIONNAIRE, volume 4] who opposes three kinds of dictionaries: one to define +*words*, the second to define *facts* and the last one to define *things*, +corresponding respectively to language, history, and science and arts +dictionaries. The first type corresponds to our modern dictionaries while the +two others are similar to what one expects to find in an encyclopedia. + +However, d'Alembert himself doesn't think of these boundaries as absolute and he +hints at the extreme difficulty in merely defining words without going into +semantics and philosophical considerations: + +> un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit +> être souvent un dictionnaire de choses quand il est bien fait + +(*a language dictionary, which appears to be only a word dictionary, must often +be a thing dictionary when it is made properly*). A similar criticism is made by +@haiman_dictionaries_1980 who attacks no less than six criteria on which +dictionaries and encyclopedias are generally opposed to reach the conclusion +that there is no distinction between them because "dictionaries *are* +encyclopedias". Regardless of the validity of his reasoning, it only proves one +inclusion: that perhaps, dictionaries would be a special case of encyclopedias. +This, as will be evidenced, does by no means imply that encyclopedias are +dictionaries. + +XML-TEI is a set of guidelines collectively developped by the +@tei_consortium_tei_2023 under the form of XML schemas, along with a range of +tools to handle them and training resources in order to represent text in a +highly structured and machine-readable format. Its toolbox has a modular +structure consisting of optional parts each covering specific needs such as the +physical features of a source document, the transcription of oral corpora or +particular requirements for textual domains like poetry, or, in the case at +hand, dictionaries. + +After describing why the dedicated +module was a natural candidate to consider, I formalise tools from graph +theory to browse the specifications of this guideline in a rational way and +explore this module in detail. + +@romary_formal_2007 + +(@ide_encoding_1995 *dictionaries* only for western dictionaries) have been +applied for both historical (@bohbot2018) and digitally native +(@bowers_bridging_2018). In addition, a specific guidelines tailored at encoding +dictionaries, TEI-Lex0, has been published [@banski_tei_lex0_2017]. + +Systematic study of the guidelines @ide_background_1998 but here's a new method. + +Less than ten years after the beginings of the TEI, @ide_background_1998 gives a +thorough account of the criteria + + # Dictionaries and encyclopedias -After emerging from dictionaries during the 18^th^ century, encyclopedias became -a fertile subgenre in themselves and a rich subject of study to digital -humanities for their particular relation to knowledge and its evolution. In this -section we will describe the goal of our project, then look at the origin of the -term "encyclopedia" itself before comparing the approaches of encyclopedias and +After emerging over the course of the 18^th^ century, encyclopedias became a +fertile subgenre in themselves and a rich subject of study to digital humanities +for their particular relation to knowledge and its evolution. This section +describes the goal of the project, then looks at the origin of the term +"encyclopedia" itself before comparing the approaches of encyclopedias and dictionaries. ## Context of the project @@ -91,9 +148,9 @@ near synonyms to refer to books compiling vast amounts of knowledge into lists of definitions ordered alphabetically. Their similarity is even visible in the way they are coordinated in the full title of the *Encyclopédie* which is probably the most famous work of the genre and a symbol of the Age of -Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it -was much more unusual and in fact controversial when Diderot and d'Alembert -decided to use it in the title of their book. +Enlightenment. If the word "encyclopedia" is nowadays part of everyday +vocabulary, it was much more unusual and in fact controversial when Diderot and +d'Alembert decided to use it in the title of their book. The definition given by Furetière in his *Dictionnaire Universel* in 1690 is still close to its greek etymology: a "ring of all knowledges", from *κÏκλος*, @@ -171,6 +228,8 @@ and remain within the linguistic level of things. Entries in a dictionary often feature information such as the part of speech, the pronunciation or the etymology of the word they define. +# <FIXME + The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three types of dictionaries: one to define *words*, the second to define *facts* and the last one to define *things*, corresponding to the distinction between @@ -181,36 +240,44 @@ means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*, "reasoned dictionary", introducing the idea of encyclopedias as dictionaries with additional structure and a philosophical dimension. -Back to the "Encyclopédie" article we read that a dictionary remaining strictly -at the language level, a vocabulary, can be seen as the empty frame required for -an encyclopedic dictionary that will fill it with additional depth. Given how -d'Alembert insists on the importance of brevity for a clear definition in the -"Dictionnaire de Langues" entry, it is clear that the *encyclopédistes* did not -consider encyclopedias superior to dictionaries but really as a new subgenre -departing from them in terms of purpose. +# FIXME> + +Back to the "Encyclopédie" article one can read that a dictionary remaining +strictly at the language level, a vocabulary, can be seen as the empty frame +required for an encyclopedic dictionary that will fill it with additional depth. +Given how d'Alembert insists on the importance of brevity for a clear definition +in the "Dictionnaire de Langues" entry, it is clear that the *encyclopédistes* +did not consider encyclopedias superior to dictionaries but really as a new +subgenre departing from them in terms of purpose. # The *dictionaries* TEI module {#sec:dictionaries-module} -The XML-TEI standard has a modular structure consisting of optional parts each +# <FIXME +The XML-TEI toolbox has a modular structure consisting of optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains -like poetry, or, in our case, dictionaries. After describing why the dedicated -module was a natural candidate to meet our needs, we formalise tools from -graph theory to browse the specifications of this standard in a rational way and +like poetry, or, in the case at hand, dictionaries. After describing why the dedicated +module was a natural candidate to consider, I formalise tools from graph +theory to browse the specifications of this guideline in a rational way and explore this module in detail. +# FIXME> -## A good starting point +## A good starting point {#sec:starting-point} Data produced in the context of a project such as DISCO-LGE cannot be useful to future scientific projects unless it is *interoperable* and *reusable*. These are the two last key aspects of the FAIR ([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) principles (*findability*, -*accessibility*, *interoperability* and *reusability*) which we strive to follow -as a guideline for efficient and quality research. It entails using standard +*accessibility*, *interoperability* and *reusability*) which I strive to follow +as a guideline for efficient and quality research. + +# <FIXME +It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the *Text Encoding Initiative* consortium which publishes a set of technical specifications under the form of XML schemas, along with a range of tools to handle them and training resources. +# FIXME> The *dictionaries* module has been leveraged to encode dictionaries in projects NENUFAR @@ -218,28 +285,30 @@ NENUFAR and BASNUM ([https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003)) to encode respectively the *Petit Larousse Illustré* published by Pierre -Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to our target encyclopedia -and the *Dictionnaire Universel* by Furetière, or rather its second edition -edited by Henri Basnage de Beauval, an encyclopedic dictionary from the very -early 18^th^ century [@williams2017, p. 1]. These successes made it a good starting -point for our own encoding but the former does not have the encyclopedic -dimension our corpus has and the latter is a much older text which had a -tremendous influence on the european encyclopedic effort of the 18^th^ century -but is not as clearly separated from the dictionaric stem as *La Grande -Encyclopédie* is. For these reasons, we could not directly reuse the encoding -schemes used in these projects and had to explore the XML-TEI schema -systematically to devise our own. - -In this chapter, we need to name and manipulate XML elements. We choose to -represent them in a monospace font, in the standard XML autoclosing form within -angle brackets and with a slash following the element name like `<div/>` for a -`div` element -([https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html)). We do not mean by this notation that they cannot contain -raw text or other XML elements, merely that we are referring to such an element, -with all the subtree that spans from it in the context of a concrete document -instance or as an empty structure when we are considering the abstract element -and the rules that govern its use in relation to other elements or its -attributes. +Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE* +*Dictionnaire Universel* by Furetière, or rather its second edition edited by +Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^ +century [@williams2017, p. 1]. These successes suggested it to be a useful tool +to encode encyclopedias but a few differences remained between both projects and +DISCO-LGE: the text studied by NENUFAR does not have the encyclopedic dimension +*LGE* has and BASNUM studies a much older text which had a tremendous influence on the +european encyclopedic effort of the 18^th^ century but is not as clearly +separated from the dictionaric stem as *La Grande Encyclopédie* is. For these +reasons, the encoding schemes used in these projects could not be reused +directly, prompting for a systematic exploration of the XML-TEI schema to devise +a new one. + +This chapter discusses XML elements in depth and hence needs to name and +manipulate them. They will be represented in a monospace font, in the standard +XML autoclosing form within angle brackets and with a slash following the +element name like `<div/>` for a `div` element +([https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html)). +This notation does not mean to imply that they cannot contain raw text or other +XML elements, it merely denotes such an element, without any additional +assumption. In the context of a concrete document instance this can refer to the +markup with all the subtree that possibly spans from it, but the same notation +will be used when considering the abstract element and the rules that govern its +use in relation to other elements or its attributes. ## A graph problem @@ -249,26 +318,27 @@ almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step. -We transform the problem by representing this network as a directed graph, using -elements of XML-TEI as nodes and placing edges if the destination node may be -contained within the source node according to the schema. Please note that the -word "element" is here used with the same meaning as in the TEI documentation to -refer to the conceptual device characterised by a given tag name such as `p` or -`div` and not to a particular instance of them that may occur in a given -document. Figure @fig:dictionaries-subgraph, by using this transformation to -display the *dictionaries* module, hints at the overall complexity of the whole -specification. +The problem can be advantageously transformed by representing this network as a +directed graph, using elements of XML-TEI as nodes and placing edges if the +destination node may be contained within the source node according to the +schema. Please note that the word "element" is here used with the same meaning +as in the TEI documentation to refer to the conceptual device characterised by a +given tag name such as `p` or `div` and not to a particular instance of them +that may occur in a given document. Figure @fig:dictionaries-subgraph, by using +this transformation to display the *dictionaries* module, hints at the overall +complexity of the whole specification. {height=830px #fig:dictionaries-subgraph} By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by -an edge" we define *inclusion paths* which allow us to explore which elements -may be nested under which other. +an edge" one defines *inclusion paths*, allowing to explore which elements may +be nested under which other. The nodes visited along the way represent the intermediate XML elements to construct a valid XML tree according to the TEI schema. Given the top-down -semantics of those trees, we call the length of an inclusion path its *depth*. +semantics of those trees, the length of an inclusion path will be called its +*depth*. The ability for an element to contain itself corresponds directly to loops on the graph (that is an edge from a node to itself) as can be illustrated by the @@ -276,17 +346,17 @@ the graph (that is an edge from a node to itself) as can be illustrated by the another one. The generalisation of this to inclusion paths of any length greater than one is -usually called a cycle and we may be tempted in our context to refine this and -name them *inclusion cycles*. The `<address/>` element provides us with an -example for this configuration: although an `<address/>` element may not -directly contain another one, it may contain a `<geogName/>` which, in turn, may -contain a new `<address/>` element. From a graph theory perspective, we can say -that it admits an inclusion cycle of length two. +usually called a cycle and it appears natural to refine this and name them +*inclusion cycles*. The `<address/>` element provides an example for this +configuration: although an `<address/>` element may not directly contain another +one, it may contain a `<geogName/>` which, in turn, may contain a new +`<address/>` element. From a graph theory perspective, one can say that it +admits an inclusion cycle of length two. Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59] -allows us to explore the shortest inclusion paths that exist between elements. +lets one explore the shortest inclusion paths that exist between elements. Though a particular caution should be applied because there is no guarantee that -the shortest path is meaningful in general, it at least provides us with an +the shortest path is meaningful in general, it at least provides an efficient way to check whether a given element may or not be nested at all under another one and gives a lower bound on the length of the path to expect. Of course the accuracy of this heuristic decreases as the length of the elements @@ -297,7 +367,7 @@ This is still very useful when taking into account the fact that TEI modules are merely "bags" to group the elements and provide hints to human encoders about the tools they might need but have no implication on the inclusion paths between elements which cross module boundaries freely. The general graph formalism -enables us to describe complex filtering patterns and to implement queries to +enables one to describe complex filtering patterns and to implement queries to look for them among the elements exhaustively by algorithmic means even when the shortest-path approach is not enough. @@ -316,12 +386,12 @@ A last relevant example on the use of these methods can be given by querying the shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it yields an inclusion directly through `<entryFree/>` (with an inclusion path of length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly -not what we want depending on the regularity of the articles we are encoding and +not what is wanted depending on the regularity of the articles being encoded and the occurrence of other grammatical information such as `<case/>` or `<gen/>` to justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to -length 3 returns as expected the path through `<entry/>`, among others. Overall, -we get a good general idea: `<pos/>` does not need to be nested very deep, it -can appear quite near the "surface" of article entries. +length 3 returns as expected the path through `<entry/>`, among others. The big +picture starts to appear: `<pos/>` does not need to be nested very deep, it can +appear quite near the "surface" of article entries. ## Content of the module @@ -333,15 +403,15 @@ element to the dictionary module: indeed, although `<body/>` may also contain `<entry/>` while the latter is a device to group several related entries together. Both can contain an `<entry/` directly while no obvious inclusion exists the other way around: most (> 96.2%) of the inclusion paths of -"reasonable" depth (which we define as strictly inferior to 5, that is twice the -average shortest depth between any two nodes) either include `<figure/>` or -`<castList/>`, two very specific elements which should not need to appear in an -article in general, showing that the purpose of `<entry/>` is not to contain an -`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the -documentation but also the structure of the elements graph evidence `<entry/>` -as the natural top-most element for an article. This somewhat contrived example -hopes to further demonstrate the application of a graph-centred approach to -understand the inner workings of the XML-TEI schema. +"reasonable" depth (which will be arbitrarily defined as strictly inferior to 5, +that is twice the average shortest depth between any two nodes) either include +`<figure/>` or `<castList/>`, two very specific elements which should not need +to appear in an article in general, showing that the purpose of `<entry/>` is +not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the +semantics conveyed by the documentation but also the structure of the elements +graph evidence `<entry/>` as the natural top-most element for an article. This +somewhat contrived example hopes to further demonstrate the application of a +graph-centred approach to understand the inner workings of the XML-TEI schema. Once a block for an article is created, it may contain elements useful to represent various of its features. Its written and spoken forms are usually @@ -370,7 +440,7 @@ redirection, with an imperative locution like "please see […]". The "active" part of the cross-reference, that is the very word within the `<xr/>` that is considered to be the link or, to make a modern-day HTML metaphor, the region that would be clickable, is represented by a `<ref/>` -element. Though it is not specific to the *dictionaries* module, we include it +element. Though it is not specific to the *dictionaries* module, it is included in this description of the toolbox because it is particularly useful in the context of dictionaries. This element may have a target attribute which points to the other resource to be accessed by the interested reader. @@ -387,7 +457,7 @@ under the `<entry/>`. Before concluding this description of the *dictionaries* module from the perspective of someone trying to concretely encode a particular dictionary or -encyclopedia, we make use of the graph approach again to evidence some its +encyclopedia, the graph approach is again leveraged to evidence some of its aspects in terms of inclusion structure. First, it is remarkable that all elements in the *dictionaries* module have a @@ -405,25 +475,25 @@ official documentation. Among those (shortest) cycles, 20 include the `<cit/>` element made to group quotations with a bibliographic reference to their source which should clearly be unnecessary to encode an article in the general case. -Secondly, although we have seen examples of connections from this module to the -rest of the XML-TEI, especially to the *core* module (to which belongs for -example the `<ref/>` element), the *dictionaries* module appears somewhat -isolated from important structural elements like `<head/>` or `<div/>`. Indeed, -computing all the paths from either `<entry/>` or `<sense/>` elements to the -latter of length shorter or equal to 5 by a systematic traversal of the graph -yields exclusively paths (respectively 9042 and 39093 of them) containing either -a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly -suggests, is used to encode text that does not quite fit the regular flow of the -document, as for example in the context of an embedded narrative. Both examples -displayed in the online documentation feature a `<body/>` as direct child of -`<floatingText/>`, neatly separating its content as independent. The purpose of -the second one, although its name — short for apparatus — is less clear, is to -wrap together several versions of the same excerpts, for instance when there are -several possible readings of an unclear group of words in a manuscript, or when -the encoder is trying to compile a single version of a piece of work from -several sources which disagree over some passage. In both case, it appears -obvious that it is not something that is expected to occur naturally in the -course of an article in general. +Secondly, although examples of connections from this module to the rest of the +XML-TEI have been evidenced in this section, especially to the *core* module (to +which belongs for example the `<ref/>` element), the *dictionaries* module +appears somewhat isolated from important structural elements like `<head/>` or +`<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>` +elements to the latter of length shorter or equal to 5 by a systematic traversal +of the graph yields exclusively paths (respectively 9042 and 39093 of them) +containing either a `<floatingText/>` or an `<app/>` element. The first one, as +its name aptly suggests, is used to encode text that does not quite fit the +regular flow of the document, as for example in the context of an embedded +narrative. Both examples displayed in the online documentation feature a +`<body/>` as direct child of `<floatingText/>`, neatly separating its content as +independent. The purpose of the second one, although its name — short for +apparatus — is less clear, is to wrap together several versions of the same +excerpts, for instance when there are several possible readings of an unclear +group of words in a manuscript, or when the encoder is trying to compile a +single version of a piece of work from several sources which disagree over some +passage. In both case, it appears obvious that it is not something that is +expected to occur naturally in the course of an article in general. Thus, despite a rather dense internal connectivity, the *dictionaries* module fails to provide encoders with a device to represent recursively nesting @@ -432,21 +502,21 @@ structures like `<div/>`. # A new standard ? Studying the content of *La Grande Encyclopédie* and considering several -articles in particular, we identify structures which are specific to +articles in particular, one can identify structures which are specific to encyclopedias and not compatible with the *dictionaries* module presented in the -previous section. We hence conclude that this module is not able to encode -arbitrary encyclopedic content and propose a new fully TEI-compliant encoding -scheme remaining outside of it. We proceed with remarks about the needs of -automated encoding processes and compare our proposal with other strategies to +previous section. It follows that this module is not able to encode arbitrary +encyclopedic content and propose a new fully TEI-compliant encoding scheme +remaining outside of it. The rest of the section is concerned with the needs of +automated encoding processes and compares the proposal with other strategies to overcome the issues previously identified with the dedicated module for dictionaries. ## Idiosynchrasies of encyclopedias Browsing through the pages of an encyclopedia reveals a certain number of -noticeable differences. It is difficult to make a precise list because the -editorial choices may vary greatly between encyclopedias but we discuss some of -the most obvious. +noticeable differences. A comprehensive list would be difficult to draw because +of the great variety in terms of editorial choices the most obvious can be +discussed. The first immediately visible feature that sets encyclopedias apart from dictionaries and can be found in the *Encyclopédie* as well as in *La Grande @@ -456,24 +526,24 @@ system. Those generally cover a broad range of subjects from scientific disciplines to litterature, and extending to political subjects and law. No element in the *dictionaries* module is explicitely designed for the purpose -of encoding these indicators. As we have seen, the elements set is geared -towards the words themselves instead of the concept they represent. The closest -tool for what we need is found in the `<usg/>` element used with a specific -`type` attribute set to `dom` for "domain". Indeed several examples from the -documentation encode subject indicators very similar to the ones found in -encyclopedias within this element, but the match is not perfect either: all -appear within one of multiple senses, as if to clarify each context in which the -word can be used, as expected from the element's name, "usage". In -encyclopedias, if the domain indicator does in certain cases help to distinguish -between several entries sharing the same headword, the concept itself has -evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the -adjective *raisonné* in the rest of the title directly introduces a notion of -structure that links back to the "Systême figuré des connoissances humaines" -[@blanchard2002, p. 1] which schematic structure is shown in Figure -@fig:systeme-figure. The authors have devised a branching system to classify all -knowledge, and the occurrence at the beginning of articles, more than a tool to -clear up possible ambiguities also points the reader to the correct place in -this mind map. +of encoding these indicators. As section @sec:dictionaries-module illustrates, +the elements set is geared towards the words themselves instead of the concept +they represent. The tool closest to what is needed can be found in the `<usg/>` +element used with a specific `type` attribute set to `dom` for "domain". Indeed +several examples from the documentation encode subject indicators very similar +to the ones found in encyclopedias within this element, but the match is not +perfect either: all appear within one of multiple senses, as if to clarify each +context in which the word can be used, as expected from the element's name, +"usage". In encyclopedias, if the domain indicator does in certain cases help to +distinguish between several entries sharing the same headword, the concept +itself has evolved beyond this mere distinction. Looking back at the +*Encyclopédie*, the adjective *raisonné* in the rest of the title directly +introduces a notion of structure that links back to the "Systême figuré des +connoissances humaines" [@blanchard2002, p. 1] which schematic structure is +shown in Figure @fig:systeme-figure. The authors have devised a branching system +to classify all knowledge, and the occurrence at the beginning of articles, more +than a tool to clear up possible ambiguities also points the reader to the +correct place in this mind map. )](ressources/arbre.png){width=300px #fig:systeme-figure} @@ -537,17 +607,17 @@ which are in turn generally developed over several paragraphs. )](ressources/europe_t16.png){#fig:europe} -The nested structure that we have just evidenced demands of course a nesting -structure to accomodate it. More precisely it guides our search of XML elements -by giving us several constraints: we are looking for a pair of elements, the -first representing a (sub)section must be able to include both itself and the -second element, which does not have any special constraint except the one to -have a semantics compatible with our purpose of using it to represent section -titles. In addition, the first element must be able to contain several `<p/>` -elements, `<p/>` being the reference element to encode paragraphs according to -the XML-TEI documentation. - -We have seen that the *dictionaries* module was equiped with a questionable but +The nested structure that have just been evidenced demands of course a nesting +structure to accomodate it. More precisely, it guides the search of XML elements +by adding several constraints: what is required is a pair of elements. The first +one representing a (sub)section must be able to include both itself and the +second one, which does not have any special constraint except the one to have a +semantics compatible with the purpose of being used to represent section titles. +In addition, the first element must be able to contain several `<p/>` elements, +`<p/>` being the reference element to encode paragraphs according to the XML-TEI +documentation. + +The *dictionaries* module has been shown to be equiped with a questionable but possible element for subject domains. However, it does not include any element for section titles. In the rest of the TEI specification, the elements `<head/>` and `<title/>` — the latter with the possibility to set its `type` attribute to @@ -562,41 +632,42 @@ article with an `<entryFree/>`, an element supposed to relax some constraint to accomodate more unusual structure in dictionaries does not bring any improvement. -The lack of results from these simple queries forces us to somewhat release the -constraints on the encoding we are willing to use. We can for instance make the -asumption that the occurrence of an intermediate element could be needed between -the element wrapping the whole article and the recursing one used to encode each -section. This "section" element could also need a companion element to be able -to include itself, or, to formalise it in terms of graph theory, we could relax -the condition that this element admits a loop to consider instead cycles of a -given (small, this still needs to represent a fairly direct inclusion) length to -be enough. We simultaneously extend the maximum depth of the inclusion paths we -are looking for between `<entry/>`, the pair of elements and the `<p/>` element. +The lack of results from these simple queries forces one to somewhat release the +constraints on the encoding one is willing to use. The occurrence of an +intermediate element could for instance be needed between the element wrapping +the whole article and the recursing one used to encode each section. This +"section" element could also need a companion element to be able to include +itself, or, to formalise it in terms of graph theory, the condition that this +element admits a loop could be relaxed to consider instead cycles of a given +(small, this still needs to represent a fairly direct inclusion) length to be +enough. Simultaneously the maximum depth of the inclusion paths between +`<entry/>`, the pair of elements and the `<p/>` element will be increased to +yield more results. By setting this depth to 3, that is, by accepting one intermediate element to occur in the middle of each one of the inclusion paths that define the structure -required to encode encyclopedic discourse, we find 21 elements but none of them -stand out as an obvious good solution: all paths to include the `<p/>` element -from any *dictionaries* element either contains a `<figure/>` (which we have -encountered earlier when we were practising our graph approach to search for -inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in -general), a `<stage/>` (reserved to stage direction in dramatic works) or a -`<state/>` (used to describe a temporary quality in a person or place), again -not even close to what we want. The paths to either `<head/>` or `<title/>` are -similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns -the exact same candidates. If that is not a thorough proof that none of these -elements could fulfill our purpose, it is a fact than no element in this module -appears as an obvious good solution and a serious hint to keep looking somewhere -else. - -We hence widen our search to include elements outside the *dictionaries* module -which could be used to encode our sections and subsections, under the same -constraint as before to try and find a composite solution that would remain -under the `<entry/>` element even if resorting to subcomponents outside of the -dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>` -and `<note/>`. - -The first one as we have repeatedly underlined is meant for graphic information +required to encode encyclopedic discourse, 21 elements can be found, none of +which stands out as an obvious good solution: all paths to include the `<p/>` +element from any *dictionaries* element either contains a `<figure/>` (already +discussed in section @sec:dictionaries-module when practising the graph approach +to search for inclusions between `<entry/>` and `<entryFree/>` and dismissed as +not useful in general), a `<stage/>` (reserved to stage direction in dramatic +works) or a `<state/>` (used to describe a temporary quality in a person or +place), again not even close to what is wanted. The paths to either `<head/>` or +`<title/>` are similarly disappointing. Again, changing `<entry/>` for +`<entryFree/>` returns the exact same candidates. If that is not a definite +proof that none of these elements could the investigated criteria, it is a fact +than no element in this module stands out as the obvious good solution and a +serious hint to keep looking somewhere else. + +Therefore, the search is extended again to include elements outside the +*dictionaries* module which could be used to encode the sections and +subsections, under the same constraint as before to try and find a composite +solution that would remain under the `<entry/>` element even if resorting to +subcomponents outside of the dedicated module. Only three elements are returned: +`<figure/>`, `<metamark/>` and `<note/>`. + +The first one as has been repeatedly underlined is meant for graphic information and is not suitable for text content in general. The purpose of `<metamark/>` is to transcribe the edition marks than may appear @@ -605,14 +676,14 @@ suggest an alternative reading (deletion, insertion, reordering, this is about a human editing the text from a given physical copy of it), but it is unfortunately of no use to encode a section of an article. -The first element that might at least resemble what we are looking for is the -last one, `<note/>`. It is meant to contain text, is about explaning something -and seems general enough (not specific to a given genre, or to the occurrence of -a particular object on the page). Unfortunately, its semantics still seems a bit -off compared to our need. The documentation describes it as an "additional -comment" which appears "out of the main textual stream" whereas the long -developments in articles are the very matter of the text of encyclopedias, not -mere remarks in the margins or at the foot of pages. +The first element that might at least seem acceptable is the last one, +`<note/>`. It is meant to contain text, is about explaning something and seems +general enough (not specific to a given genre, or to the occurrence of a +particular object on the page). Unfortunately, its semantics still seems a bit +off compared to what is required. The documentation describes it as an +"additional comment" which appears "out of the main textual stream" whereas the +long developments in articles are the very matter of the text of encyclopedias, +not mere remarks in the margins or at the foot of pages. ## Encoding within the *core* module {#sec:core-module} @@ -620,63 +691,75 @@ The remarks made in section @sec:dictionaries-module explain why the *dictionary* module is unable to represent encyclopedias, where the notion of "meaning" is less central that in dictionaries and where discourse with nested structures of arbitrary depth can occur. Even composite encodings using elements -outside of the *dictionaries* module under an `<entry/>` element do not meet our -requirements. Since the *core* module obviously accomodates these structures by -means of the `<div/>`, `<head/>` and `<p/>` elements which have the additional -advantage of carrying less semantical payload than `<sense/>` or `<def/>` we -devise an encoding scheme using them which we recommend using for other projects -aiming at representing encyclopedias. - -To remain consistent with the way we studied the *dictionaries* module we will -only concern ourselves with what happens at the level of each article, right -under the `<body/>` element. Everything related to metadata happens as expected -in the file's `<teiHeader/>` which is well-enough equiped to handle them. In -order to present our scheme throughout the following section we will be -progressively encoding a reference article, "Cathète" from tome 9 reproduced in -Figure @fig:cathete-photo. +outside of the *dictionaries* module under an `<entry/>` element do not meet the +requirements of the project. Since the *core* module obviously accomodates these +structures by means of the `<div/>`, `<head/>` and `<p/>` elements which have +the additional advantage of carrying less semantical payload than `<sense/>` or +`<def/>`, these elements will be used to devise an encoding scheme which can be +recommended for other projects aiming at representing encyclopedias. + +To remain consistent with the way the *dictionaries* module was studied only +what happens at level of each individual article will be considered, that is +right under the `<body/>` element representing a whole volume. Everything +related to its metadata happens as expected in the file's `<teiHeader/>` which +is well-enough equiped to handle them. In order to present the scheme throughout +the following section a reference article, "Cathète" from tome 9 — reproduced in +Figure @fig:cathete-photo — will be progressively encoding. )](ressources/cathète_t9.png){#fig:cathete-photo} Remaining within the *core* module for the structure, almost all useful elements -are available and our encoding scheme merely quotes the official documentation. -Each article is represented by a `<div/>`. We suggest setting an `xml:id` -attribute on it with the head word of the entry — unique in the whole corpus, or -made so by suffixing a number representing its rank among the various -occurrences, even when there's only one for the sake of regularity — as its -value, normalised to lowercase, stripping spaces and replacing all +are available and practically no additional documentation is needed beyond the +official TEI guidelines. Each article is represented by a `<div/>`. Setting an +`xml:id` attribute on it with a unique value will ease identify, browse and +retrieve the articles from the encoded corpus. An auto-increasing serial would +of course provide an appropriate value for such a unique attribute but has some +drawbacks: as long as the articles segmentation isn't fixed (which could happen +if choices regarding entries and sub-entries were to change along a project or +if, as is the case of DISCO-LGE, the automatic segmentation went through +successive improvement steps), the identifiers of articles would massively +change from one version to the other, even articles segmented correctly. Given +the iterative nature of many studies in digital humanities, this would make it +harder to use results found early in a project. For this reason, the values used +for `xml:id` in project DISCO-LGE depend only on the local quality of the +segmentation and remain globally stable. They are computed as the head word of +the entries normalised to lowercase, stripping spaces and replacing all non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML -encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container -element on the article "Cathète" previously displayed. +encoding, and suffixed by a serial to distinguish between the few entries +sharing the same head. Thus, if an oversegmentation or a subsegmentation are +fixed (meaning respectively that two "articles" get fusioned or that one +"article" actually contained several which get split as such) only articles with +the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this +choice for the container element on the article "Cathète" previously displayed. {#fig:cathete-xml-0} Inside this element should be a `<head/>` enclosing the headword of the article. The usual sub-`<hi/>` elements are available within `<head/>` if the headword is highlighted by any special typographic means such as bold, small capitals, etc. -The one disappointment of the encoding scheme we are defining in this chapter is +The one disappointment of the encoding scheme being defined in this chapter is the lack of support for a proper way to encode subject indicators. -The best candidate we have found so far was `<usg/>` from the *dictionaries* -module but it is not available directly under a `<head/>` element. All inclusion -paths from the latter to the former of length less than or equal to 3 contain -irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it -must be discarded. The next best elements appear to be `<term/>` (not very -accurate) and `<rs/>` ("referring string", quite a general semantics but a -possible match — subject indicators refer to a given domain of knowledge — -although all the examples in the documentation refer to concrete persons, -places or object, not to the abstract objects that mathematics or poetry are). - -For this reason, we do not recommend any special encoding of the subject -indicator but leave it open to each particular context: they are often -abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies -are not labeled by a knowledge domain but usually include the first name of the -person when it is known so in that case an element like `<persName/>` is still -appropriate. This choice applied to the same article "Cathète" produces Figure -@fig:cathete-xml-1. +The best candidate found so far was `<usg/>` from the *dictionaries* module but +it is not available directly under a `<head/>` element. All inclusion paths from +the latter to the former of length less than or equal to 3 contain irrelevant +elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it must be +discarded. The next best elements appear to be `<term/>` (not very accurate) and +`<rs/>` ("referring string", quite a general semantics but a possible match — +subject indicators refer to a given domain of knowledge — although all the +examples in the documentation refer to concrete persons, places or object, not +to the abstract objects that mathematics or poetry are). + +For this reason, no particular encoding of the subject indicator is recommended +and it is left open to each particular context: they are often abbreviated so an +`<abbr/>` may apply, in *La Grande Encyclopédie*, biographies are not labeled by +a knowledge domain but usually include the first name of the person when it is +known so in that case an element like `<persName/>` is still appropriate. This +choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1. {#fig:cathete-xml-1} -We then propose to wrap each different meaning in a separate `<div/>` with the +Each different meaning could then be wrapped in a separate `<div/>` with the `type` attribute set to `sense` to refer to the `<sense/>` element that would have been used within the *core* module. The `<div/>`s should be numbered according to the order they appear in with the `n` attribute starting from `0` @@ -711,16 +794,16 @@ Figure @fig:boumerang-photo, which should be encoded the standard way by {#fig:boumerang-xml} Another issue arising from giving up on `<entry/>` is the unavailability of the -`<xr/>` element, not allowed under any of the *core* elements we use but which -is useful to represent cross-references occurring in encyclopedias as well as in +`<xr/>` element, not allowed under any of the *core* elements used but which is +useful to represent cross-references occurring in encyclopedias as well as in dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo). -We prefer to use the `<ref/>` element instead which is available in the context -of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the +It is prefered to use the `<ref/>` element instead which is available in the +context of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml. Another solution would have been to introduce a `<dictScrap/>` element for the -sole purpose of placing an `<xr/>` but we advocate against it on account of the -verbosity it would add to the encoding and the fact that it implicitly suggests -that the previous context was not the one of a dictionary. +sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the +encoding and implicitly suggest that the previous context was not the one of a +dictionary which is rather problematic. )](ressources/gelocus_t18.png){#fig:gelocus-photo} @@ -739,7 +822,7 @@ the text, like the beginning of a new column of text or of a new page. Figure @fig:alcala-photo shows the top left of the last page of the first tome of *La Grande Encyclopédie* which features peritext elements while marking the beginning of a new page. The usual appropriate elements (`<pb/>` for page -beginning, `<cb/>` for column beginning) may and should be used with our +beginning, `<cb/>` for column beginning) may and should be used with this encoding scheme as demonstrated by Figure @fig:alcala-xml. )](ressources/last_page_top_left_t1.png){width=350px #fig:alcala-photo} @@ -752,8 +835,8 @@ developed within the scope of project DISCO-LGE to automatically identify individual articles in the flow of raw text from the columns and to encode them into XML-TEI files. Though this software has already been used to produce the first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet -the specification we have just described. Figure @fig:cathete-xml-current shows -the encoded version of article "Cathète" it currently produces: +the specification described in this chapter. Figure @fig:cathete-xml-current +shows the encoded version of article "Cathète" it currently produces: {#fig:cathete-xml-current} @@ -797,7 +880,7 @@ information (for the second one, adjacent to a notion as elusive as truth) which requires a very deep understanding of a text in its entirety and about which even some human experts may disagree. -For these reasons, a central concern in the design of our encoding scheme was to +For these reasons, a central concern in the design of an encoding scheme was to remain within the boundaries of information that can be described objectively and extracted automatically by an algorithm. Most of the tags presented in section @sec:core-module contain information about the positions of the elements @@ -806,30 +889,29 @@ like `<head/>` can be inferred simply from their position and the frequent use of a special typography like bold or upper-case characters. The case of cross-references is particular and may appear as a counter-example -to the main principle on which our scheme is based. Actually, the process of +to the main principle on which this scheme is based. Actually, the process of linking from an article to another one is so frequent (in dictionaries as well as in encyclopedias) that it generally escapes the scope of regular discourse to take a special and often fixed form, inside parenthesis and after a special token which invites the reader to perform the redirection. In *La Grande -Encyclopédie*, virtually all the redirections (that is, to the extent of our -knowledge, absolutely all of them though of course some special case may exist, -but they are statistically rare enough that we have not found any yet) appear -within parenthesis, and start with the verb "voir" abbreviated as a single, -capital "V." as illustrated in the article "Gelocus" (see again Figure -@fig:gelocus-photo). - -Although this has not been implemented yet either, we hope to be able to detect -and exploit those patterns to correctly encode cross-references. Getting the -`target` attributes right is certainly more difficult to achieve and may require +Encyclopédie*, virtually all the redirections appear within parenthesis (at +least no counter-example has been found within the scope of the project), and +start with the verb "voir" abbreviated as a single, capital "V." as illustrated +in the article "Gelocus" (see again Figure @fig:gelocus-photo). + +Although this has not been implemented yet either, being able to detect and +exploit those patterns to correctly encode cross-references does not pose any +fundamental theoretical problem and should be achievable. Getting the `target` +attributes right is certainly more difficult to achieve and may require processing the articles in several steps, to first discover all the existing headwords — and hence article IDs — before trying to match the words following -"V." with them. Since our automated encoder handles tomes separately and since -references may cross the boundaries of tomes, it cannot wait for the target of a -cross-reference to be discovered by keeping the articles in memory before -outputting them. +"V." with them. Since the automated encoder implemented in the project handles +tomes separately and since references may cross the boundaries of tomes, it +cannot wait for the target of a cross-reference to be discovered by keeping the +articles in memory before outputting them. -This is in line with the last important aspect of our encoder. If many -lexicographers may deem our encoding too shallow, it has the advantage of not +This is in line with the last important aspect of the encoder. If many +lexicographers may deem this encoding too shallow, it has the advantage of not requiring to keep too complex datastructures in memory for a long time. The algorithm implementing it in `soprano` outputs elements as soon as it can. This is immediate for simple elements such as `<pb/>` or `<fw/>`; for articles, it @@ -843,50 +925,55 @@ lowered to around forty minutes on a machine with 16Go of RAM for the whole of ## Comparison to other approaches The previous section about the structure of the *dictionaries* module and the -features found in encyclopedias follows quite closely our own journey trying to -encode first manually then by automatic means the articles of our corpus. This -back and forth between trying to find patterns in the graph which reflects the patterns -found in the text and questioning the relevance of the results explains the -choice we ended up making but also the alternatives we have considered. - -Several times, the issue of the semantics of some elements which posess the -properties we need came up. This is the case for instance of the `<sense/>` and -`<node/>` elements. It is very tempting to bend their documented semantics or to -consider that their inclusion properties is part of what defines them, and hence -justifies their ways in creative ways not directly recommended by the TEI -specifications. - -This is the approach followed by project BASNUM[^BASNUM]. In the articles -encoded for this project, `<note/>` elements are nested and used to structure -the encyclopedic developments that occur in the articles. - -We have chosen not to follow the same path in the name of the FAIR principles to -avoid the emergence of a custom usage differing from the documented one. - -The other major reason behind our choice was the inclusion rules which exist -between TEI elements and pushed us to look for different combinations. Another -valid approach would have consisted in changing the structure of the inclusion -graph itself, that is to say modify the rules. If `<entry/>` is the perfect -element to encode article themselves, all that is really missing is the ability -to accomodate nested structures with the `<div/>` element. This would also have -the advantage of recovering the `<usg/>` and `<xr/>` elements which we have -recognised as useful and which we lose as part of the tradeoff to get nested -sections. Generating customised TEI schemas is made really easy with tools like -ROMA ([https://roma.tei-c.org/](https://roma.tei-c.org/)), which we used to -preview our change and suggest it to the TEI community. +features found in encyclopedias follows reflects the issues which have arised +along the course of the project while trying to encode first manually and then +by automatic means the articles of its corpus. This back and forth between +trying to find patterns in the graph which reflects the patterns found in the +text and questioning the relevance of the results explains the choice advocated +in this chapter but also the alternatives considered. + +Several elements exhibited some interesting properties, having for instance some +interesting inclusion path corresponding to the structure needed to represent +the nested structure of articles. This is the case for instance of the +`<sense/>` and `<note/>` elements. It is very tempting to bend their documented +semantics or to consider that their inclusion properties is part of what defines +them, and hence justifies their ways in creative ways not directly recommended +by the TEI specifications. + +This is the approach followed by project BASNUM (see section +@sec:starting-point). In the articles encoded for this project, `<note/>` +elements are nested and used to structure the encyclopedic developments that +occur in the articles. + +For the sake of the FAIR principles, this was not the path chosen by project +DISCO-LGE, in order to avoid the emergence of a custom usage differing from the +one documented in the official guidelines. + +The other major reason behind the choice that was ultimately made was the +existing TEI rules governing element inclusions which prompted the search for +different combinations. Another valid approach would have consisted in changing +the structure of the inclusion graph itself, that is to say modify the rules. If +`<entry/>` is the perfect element to encode article themselves, all that is +really missing is the ability to accomodate nested structures with the `<div/>` +element. This would also have the advantage of recovering the `<usg/>` and +`<xr/>` elements which appear useful and which are lost as part of the tradeoff +to get nested sections. Generating customised TEI schemas is made really easy +with tools like ROMA ([https://roma.tei-c.org/](https://roma.tei-c.org/)), which +was used to preview this change and suggest it to the TEI community. Despite it not getting a wide adhesion, some suggested it could be used locally -within the scope of project DISCO-LGE. However we chose not to do so, partially -for the same reasons of interoperability as the previous scenario, but also for -reasons of sturdiness in front of future evolutions. Making sure the alternative -schema would remain useful entails to maintain it, regenerating it should the -schema format evolve, with the risk that the tools to edit it might change or -stop being maintained. +within the scope of project DISCO-LGE. However it was preferred not to do so, +partially for the same reasons of interoperability as the previous scenario, but +also for reasons of sturdiness in front of future evolutions. Making sure the +alternative schema would remain useful entails to maintain it, regenerating it +should the schema format evolve, with the risk that the tools to edit it might +stop being maintained or that some conflicts between this change and future +modifications of the official guidelines might arise. # Conclusion -Though they are very close genres and share a common history, we have evidenced -key aspects on which dictionaries and encyclopedias differ. Not only do entries +Though they are very close genres and share a common history, key differences +between dictionaries and encyclopedias have been evidenced. Not only do entries tend to be longer in encyclopedias, they often have a deeper structure too. Their purpose also departs from the purpose of dictionaries from their inception, and, as anticipated by their pioneers, results in a different form of @@ -894,15 +981,16 @@ discourse. The structure of the XML-TEI *dictionaries* module reflects the assumptions made by the eponymous genre and does not appear to be flexible enough to accomodate -encyclopedias. Forcing its use to some encyclopedic articles would breach the -semantics of some elements or require the encoder to break the rules of the -consortium's schema which we think would result in a less reusable encoding in -opposition to the FAIR principles. - -We have devised and presented an encoding scheme which fully complies with -XML-TEI while being able to represent the content of encyclopedias in all their -complexity. A first implementation of this encoding, incomplete as it may be, -demonstrates its practical usefulness. +encyclopedias, despite the colossal effort which has gone into making it +expressive enough for the wide variety of existng dictionaries. Forcing its use +to some encyclopedic articles would breach the semantics of some elements or +require the encoder to break the rules of the consortium's schema which would +result in a less reusable encoding in opposition to the FAIR principles. + +An encoding scheme which fully complies with XML-TEI while being able to +represent the content of encyclopedias in all their complexity has been provided +and demonstrated on concrete examples. The tool `soprano`, partially +implementing this set of conventions demonstrates their practical usefulness. # Acknowledgement {-}