From b64c9ddb1e2816ea0316c2c4b28fa5d02e54f342 Mon Sep 17 00:00:00 2001 From: Alice BRENON <alice.brenon@ens-lyon.fr> Date: Thu, 1 Jun 2023 02:57:12 +0200 Subject: [PATCH] Is it over yet ? Can I go to bed ? --- ICHLL_Brenon.md | 217 +++++++++++++++++++++++++----------------------- biblio.bib | 8 ++ 2 files changed, 120 insertions(+), 105 deletions(-) diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index db101cd..71d57d0 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -48,7 +48,7 @@ Finally, different strategies followed by other projects are discussed. Although both terms have been used rather interchangeably over the past few centuries, a dichotomy is now commonly being made between dictionaries and -encyclopedias. A simple oppositon can easily justify this distinction: +encyclopedias. A simple opposition can easily justify this distinction: dictionaries define words and tell one how to use them while encyclopedia usually go into longer development to give a more comprehensive and scientific understanding of the concept being defined. This common intuition links back to @@ -60,8 +60,8 @@ corresponding respectively to language, history, and science and arts dictionaries. The first type corresponds to modern dictionaries while the two others are similar to what one expects to find in an encyclopedia. -However, d'Alembert himself doesn't think of these boundaries as absolute and he -hints at the extreme difficulty in merely defining words without going into +However, d'Alembert himself doesn't think of these boundaries as very strict and +he hints at the extreme difficulty in merely defining words without going into semantics and philosophical considerations: > un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit @@ -87,23 +87,25 @@ dictionaries. The intrinsic complexity of dictionaries has been well identified since the inception of the project [@tei_vault] and @ide_encoding_1995 underlines the amount of work which went into the third version of the guidelines (P3) to provide a toolbox both general and expressive enough to -account for the variety of conventions found in dictionaries. -@romary_formal_2007 This module has been successfully used to encode both -historical [@williams2017], [@bohbot2018] and digitally native dictionaries -[@bowers_bridging_2018]. In addition, a specific guidelines tailored at encoding -dictionaries named TEI-Lex0 has also been published [@banski_tei_lex0_2017]. +account for the variety of conventions found in dictionaries. This module has +been successfully used to encode both historical [@williams2017], [@bohbot2018] +and digitally native dictionaries [@bowers_bridging_2018]. In addition, a +specific guidelines tailored at encoding dictionaries named TEI-Lex0 has also +been published [@banski_tei_lex0_2017]. The TEI effort is described as "first steps" by @ide_background_1998 to reach a -standard to encode corpora and lay a common basis for corpora comparisons and +standard to encode corpora and lay a common basis for corpora comparison and reuse. They point some light inconsistencies in the design, remark that there is generally more than one way to encode a given text in XML-TEI and identify nine criteria to design a sound standard. Their claims are backed by concrete -examples of encoding situations but without giving any idea of the prevalence of -the issues found. In fact, the sheer complexity of the guidelines can make it -hard to ascertain whether a particular element structure is impossible to -represent (not finding a suitable encoding is not a proof that there is none). -This chapter will use results from graph theory to give a systematic study of -the possibilities and shortcomings of the TEI *dictionaries* module. +examples of encoding situations but give no idea of the prevalence of the issues +reported. In fact, the sheer complexity of the guidelines can make it hard to +ascertain whether a particular element structure is impossible to represent (not +finding a suitable encoding is not a proof that there is none). This chapter +will use results from graph theory to make a systematic study of the +possibilities and shortcomings of the TEI *dictionaries* module, hence providing +an additional proof that encyclopedias are not dictionaries and that the +inclusion claimed by Haiman is a strict one. # Context of the study @@ -134,7 +136,7 @@ pictures with an Optical Characters Recognition (OCR) system. This prevented an exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As a prelude to project GEODE ([https://geode-project.github.io/](https://geode-project.github.io/)), the goal -of CollEx-Persée was to produce a digital version of *LGE* with a quality +of DISCO-LGE was to produce a digital version of *LGE* with a quality comparable to the one of l'*EDdA* provided by the ARTFL ([http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)) project in order to conduct a diachronic study of both encyclopedias. @@ -163,7 +165,7 @@ Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated at the end of the 17^th^ century and attacked in the *Dictionnaire Universel François et Latin*, commonly refered to as the *Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for -"Encyclopédie" remained unchanged in the four editons issued between 1721 and +"Encyclopédie" remained unchanged in the four editions issued between 1721 and 1752, mocking the use of the word and discouraging his readers to pursue it. In that intent, he quotes a poem from Pibrac encouraging people to specialise in only one discipline lest they should not reach perfection, based on an @@ -187,13 +189,13 @@ what could possibly not be within reach of a single man, within a single lifetime may be achieved by a common effort throughout generations. History hints that Diderot's opponents took his defence of the feasability of -the project quite seriously, considering the fact that they got the -*EDdA*'s privileges to be revoked again six years after its publication -was resumed [@moureau2001]. As a consequence, the remaining ten volumes -containing the text of the articles had to be published illegally until 1765, -thanks to the secret protection of Malesherbes who — despite being head of royal -censorship — saved the manuscripts from destruction. They were printed secretly -outside of Paris and the books were (falsely) labeled as coming from Neufchâtel. +the project quite seriously, considering the fact that they got the *EDdA*'s +privileges to be revoked again six years after its publication was resumed +[@moureau2001]. As a consequence, the remaining ten volumes containing the text +of the articles had to be published illegally until 1765, thanks to the secret +protection of Malesherbes who — despite being head of royal censorship — saved +the manuscripts from destruction. They were printed secretly outside of Paris +and the books were (falsely) labeled as coming from "Neufchâtel" (*sic*). Following the high demand from the booksellers who feared they would lose the money they had invested in the project, a special privilege was issued for the volumes containing the plates, which were released publicly from 1762 to 1772. @@ -245,11 +247,10 @@ to future scientific projects, which in particular requires it to be ([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) principles (*findability*, *accessibility*, *interoperability* and *reusability*) which are important guideline for efficient, high-quality -research. The XML-TEI guidelines provide tools to achieve this goal. This -section therefore starts by describing the existing toolset it provides, before -introducing some notations and tools from graph theory which will be used to -browse the guidelines in a systematic and thorough way in section -@sec:new-standard. +research. This section starts by describing the existing toolset provided by the +XML-TEI guidelines to achieve this goal, before introducing some notations and +tools from graph theory which will be used to browse the guidelines in a +systematic and thorough way in section @sec:new-standard. ## A good starting point {#sec:starting-point} @@ -292,32 +293,57 @@ almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of combinations sharply increases with each step. -The problem can be advantageously transformed by representing this network as a -graph to benefit from the results of graph theory. Classical, well-known methods -such as Dijkstra's algorithm [@dijkstra59] which computes the shortest path -between two nodes in a graph can then be applied +The problem can be advantageously transformed to benefit from the results of +graph theory by representing the network of the XML elements as a directed graph +which nodes are connected or not depending on the inclusion rules of the +guidelines. Classical, well-known traversal techniques such as Dijkstra's algorithm +[@dijkstra59] which computes the shortest path between two nodes in a graph and +reports when they are not connected can then be applied to compute +systematically all the possible ways to nest a given element under another +without any risk to forget a route because of human error. + +Though a particular caution should be applied on the results provided by this +algorithm because there is no guarantee that the shortest path is meaningful in +general, it at least provides an efficient way to check whether a given element +may or not be nested at all under another one and gives a lower bound on the +length of a meaningful path if it exists. The accuracy of this heuristic +decreases as the length of the path increases in the perfect graph representing +the intended, meaningful path between two nodes that a human specialist of the +TEI framework could build. + +The XML-TEI guidelines graph will hence be defined as follows. One node is +created for each one of the 590 elements found in the specification. Then, an +edge is placed between source node `A` and destination `B` if the schema states +that the element represented by `B` can be contained directly under the element +represented by `B`. That is, the edges in the graph represent the relation "is +an admissible direct parent of". Please note that the word "element" is here +used with the same meaning as in the TEI documentation to refer to the +conceptual device characterised by a given tag name such as `p` or `div` and not +to a particular instance of them that may occur in a given document. Figure +@fig:dictionaries-subgraph, by using this transformation to display only the +*dictionaries* module, hints at the overall complexity of the whole +specification. +{height=830px #fig:dictionaries-subgraph} -directed graph, using elements of XML-TEI as nodes and placing edges if the -destination node may be contained within the source node according to the -schema. Please note that the word "element" is here used with the same meaning -as in the TEI documentation to refer to the conceptual device characterised by a -given tag name such as `p` or `div` and not to a particular instance of them -that may occur in a given document. Figure @fig:dictionaries-subgraph, by using -this transformation to display the *dictionaries* module, hints at the overall -complexity of the whole specification. +With this definition, moving from one node to another on the graph has an +XML-TEI counterpart. Following an edge from `A` to `B` can be understood as +preparing an XML structure of an `<A/>` element containing a `<B/>` element like +this: -{height=830px #fig:dictionaries-subgraph} +```xml +<A> + <B/> +</A> +``` By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by an edge" one defines *inclusion paths*, allowing to explore which elements may -be nested under which other. - -The nodes visited along the way represent the intermediate XML elements to -construct a valid XML tree according to the TEI schema. Given the top-down -semantics of those trees, the length of an inclusion path will be called its -*depth*. +be nested (arbitrarily deep) under which other. The nodes visited along the way +represent the intermediate XML elements required to construct a valid XML tree +according to the TEI schema. Given the top-down semantics of those trees, the +length of an inclusion path will be called its *depth*. The ability for an element to contain itself corresponds directly to loops on the graph (that is an edge from a node to itself) as can be illustrated by the @@ -332,56 +358,37 @@ one, it may contain a `<geogName/>` which, in turn, may contain a new `<address/>` element. From a graph theory perspective, one can say that it admits an inclusion cycle of length two. -Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59] -lets one explore the shortest inclusion paths that exist between elements. -Though a particular caution should be applied because there is no guarantee that -the shortest path is meaningful in general, it at least provides an -efficient way to check whether a given element may or not be nested at all under -another one and gives a lower bound on the length of the path to expect. Of -course the accuracy of this heuristic decreases as the length of the elements -increases in the perfect graph representing the intended, meaningful path -between two nodes that a human specialist of the TEI framework could build. - -This is still very useful when taking into account the fact that TEI modules are -merely "bags" to group the elements and provide hints to human encoders about -the tools they might need but have no implication on the inclusion paths between -elements which cross module boundaries freely. The general graph formalism -enables one to describe complex filtering patterns and to implement queries to -look for them among the elements exhaustively by algorithmic means even when the -shortest-path approach is not enough. - -For instance, it lets one find that although `<pos/>` may not be directly -included within `<entry/>` elements to include information about the +Using inclusion paths lets one find for instance that although `<pos/>` may not +be directly included within `<entry/>` elements to include information about the part-of-speech of the word that an article defines, the correct way to do so is -through a `<form/>` or a `<gramGrp/>`. - -On the other hand, trying to discover the shortest inclusion path to `<pos/>` -from the `<TEI/>` root of the document yields a `<standOff/>`, an element -dedicated to store contextual data that accompanies but is not part of the text, -not unlike an annex, and widely unrelated to the context of encoding an -encyclopedia. - -A last relevant example on the use of these methods can be given by querying the -shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it -yields an inclusion directly through `<entryFree/>` (with an inclusion path of -length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly -not what is wanted depending on the regularity of the articles being encoded and -the occurrence of other grammatical information such as `<case/>` or `<gen/>` to -justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to -length 3 returns as expected the path through `<entry/>`, among others. The big -picture starts to appear: `<pos/>` does not need to be nested very deep, it can -appear quite near the "surface" of article entries. +through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all +the possible path will contain `entry-form-pos` and `entry-grapmGrp-pos`. It is +left to the human encoder to rate the relevance of the path found and to select +an appropriate one. A total lack of path proves the impossibility of an +inclusion; an abnormally high length for the shortest path is a serious hint +that the inclusion should not be possible and is not meaningful. + +Another relevant example on the use of these methods can be given by querying +the shortest inclusion path of a `<pos/>` under the `<body/>` of the document: +it yields an inclusion directly through `<entryFree/>` (with an inclusion path +of length 2), which unlike `<entry/>` accepts it as a direct child node. +Possibly not what is wanted depending on the regularity of the articles being +encoded and the occurrence of other grammatical information such as `<case/>` or +`<gen/>` to justify the use of the `<gramGrp/>`, but searching exhaustively for +paths up to length 3 returns as expected the path through `<entry/>`, among +others. The big picture starts to appear: `<pos/>` does not need to be nested +very deep, it can appear quite near the "surface" of article entries. ## Content of the module The central element of the *dictionaries* module is the `<entry/>` element meant to encode one single entry in a dictionary, that is to say a head word associated to its definition. It is the natural way in from the `<body/>` -element to the dictionary module: indeed, although `<body/>` may also contain -`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of -`<entry/>` while the latter is a device to group several related entries -together. Both can contain an `<entry/` directly while no obvious inclusion -exists the other way around: most (> 96.2%) of the inclusion paths of +element to the *dictionaries* module: indeed, although `<body/>` may also +contain `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed +version of `<entry/>` while the latter is a device to group several related +entries together. Both can contain an `<entry/` directly while no obvious +inclusion exists the other way around: most (> 96.2%) of the inclusion paths of "reasonable" depth (which will be arbitrarily defined as strictly inferior to 5, that is twice the average shortest depth between any two nodes) either include `<figure/>` or `<castList/>`, two very specific elements which should not need @@ -389,8 +396,8 @@ to appear in an article in general, showing that the purpose of `<entry/>` is not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the documentation but also the structure of the elements graph evidence `<entry/>` as the natural top-most element for an article. This -somewhat contrived example hopes to further demonstrate the application of a -graph-centred approach to understand the inner workings of the XML-TEI schema. +example demonstrate again how a graph-centred approach can provide insights +about the XML-TEI schema. Once a block for an article is created, it may contain elements useful to represent various of its features. Its written and spoken forms are usually @@ -504,10 +511,10 @@ which organise them into a domain classification system. Those generally cover a broad range of subjects from scientific disciplines to litterature, and extending to political subjects and law. -No element in the *dictionaries* module is explicitely designed for the purpose -of encoding these indicators. As section @sec:dictionaries-module illustrates, -the elements set is geared towards the words themselves instead of the concept -they represent. The tool closest to what is needed can be found in the `<usg/>` +These indicators have no element in the *dictionaries* module explicitely +designed to encode them. As section @sec:dictionaries-module illustrates, the +elements set is geared towards the words themselves instead of the concept they +represent. The tool closest to what is needed can be found in the `<usg/>` element used with a specific `type` attribute set to `dom` for "domain". Indeed several examples from the documentation encode subject indicators very similar to the ones found in encyclopedias within this element, but the match is not @@ -515,14 +522,14 @@ perfect either: all appear within one of multiple senses, as if to clarify each context in which the word can be used, as expected from the element's name, "usage". In encyclopedias, if the domain indicator does in certain cases help to distinguish between several entries sharing the same headword, the concept -itself has evolved beyond this mere distinction. Looking back at the -*EDdA*, the adjective *raisonné* in the rest of the title directly -introduces a notion of structure that links back to the "Systême figuré des -connoissances humaines" [@blanchard2002, p. 1] which schematic structure is -shown in Figure @fig:systeme-figure. The authors have devised a branching system -to classify all knowledge, and the occurrence at the beginning of articles, more -than a tool to clear up possible ambiguities also points the reader to the -correct place in this mind map. +itself has evolved beyond this mere distinction. Looking back at the *EDdA*, the +adjective *raisonné* in the rest of the title directly introduces a notion of +structure that links back to the "Systême figuré des connoissances humaines" +[@blanchard2002, p. 1] which schematic structure is shown in Figure +@fig:systeme-figure. The authors have devised a branching system to classify all +knowledge, and the occurrence at the beginning of articles, more than a tool to +clear up possible ambiguities also points the reader to the correct place in +this mind map. )](ressources/arbre.png){width=300px #fig:systeme-figure} diff --git a/biblio.bib b/biblio.bib index 1296359..c94abb9 100644 --- a/biblio.bib +++ b/biblio.bib @@ -269,3 +269,11 @@ author = {d'Alembert}, editor = {Morrissey, Robert and Roe, Glenn}, } + +@misc{tei_vault, + type = {Text}, + title = {Previous drafts of the {Guidelines}}, + url = {https://tei-c.org/Vault/Vault-GL.html}, + language = {en}, + urldate = {2023-05-31}, +} -- GitLab