Is it over yet ? Can I go to bed ?

b64c9ddb · Alice Brenon · 34300a87 · b64c9ddb · b64c9ddb
Commit b64c9ddb authored 2 years ago by Alice Brenon
--- a/ICHLL_Brenon.md
+++ b/ICHLL_Brenon.md
@@ -48,7 +48,7 @@ Finally, different strategies followed by other projects are discussed.
 Although both terms have been used rather interchangeably over the past few
 centuries, a dichotomy is now commonly being made between dictionaries and
-encyclopedias. A simple oppositon can easily justify this distinction:
+encyclopedias. A simple opposition can easily justify this distinction:
 dictionaries define words and tell one how to use them while encyclopedia
 usually go into longer development to give a more comprehensive and scientific
 understanding of the concept being defined. This common intuition links back to
@@ -60,8 +60,8 @@ corresponding respectively to language, history, and science and arts
 dictionaries. The first type corresponds to modern dictionaries while the two
 others are similar to what one expects to find in an encyclopedia.
-However, d'Alembert himself doesn't think of these boundaries as absolute and he
+However, d'Alembert himself doesn't think of these boundaries as very strict and
-hints at the extreme difficulty in merely defining words without going into
+he hints at the extreme difficulty in merely defining words without going into
 semantics and philosophical considerations:
 > un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit
@@ -87,23 +87,25 @@ dictionaries. The intrinsic complexity of dictionaries has been well identified
 since the inception of the project [@tei_vault] and @ide_encoding_1995
 underlines the amount of work which went into the third version of the
 guidelines (P3) to provide a toolbox both general and expressive enough to
-account for the variety of conventions found in dictionaries.
+account for the variety of conventions found in dictionaries. This module has
-@romary_formal_2007 This module has been successfully used to encode both
+been successfully used to encode both historical [@williams2017], [@bohbot2018]
-historical [@williams2017], [@bohbot2018] and digitally native dictionaries
+and digitally native dictionaries [@bowers_bridging_2018]. In addition, a
-[@bowers_bridging_2018]. In addition, a specific guidelines tailored at encoding
+specific guidelines tailored at encoding dictionaries named TEI-Lex0 has also
-dictionaries named TEI-Lex0 has also been published [@banski_tei_lex0_2017].
+been published [@banski_tei_lex0_2017].
 The TEI effort is described as "first steps" by @ide_background_1998 to reach a
-standard to encode corpora and lay a common basis for corpora comparisons and
+standard to encode corpora and lay a common basis for corpora comparison and
 reuse. They point some light inconsistencies in the design, remark that there is
 generally more than one way to encode a given text in XML-TEI and identify nine
 criteria to design a sound standard. Their claims are backed by concrete
-examples of encoding situations but without giving any idea of the prevalence of
+examples of encoding situations but give no idea of the prevalence of the issues
-the issues found. In fact, the sheer complexity of the guidelines can make it
+reported. In fact, the sheer complexity of the guidelines can make it hard to
-hard to ascertain whether a particular element structure is impossible to
+ascertain whether a particular element structure is impossible to represent (not
-represent (not finding a suitable encoding is not a proof that there is none).
+finding a suitable encoding is not a proof that there is none).  This chapter
-This chapter will use results from graph theory to give a systematic study of
+will use results from graph theory to make a systematic study of the
-the possibilities and shortcomings of the TEI *dictionaries* module.
+possibilities and shortcomings of the TEI *dictionaries* module, hence providing
+an additional proof that encyclopedias are not dictionaries and that the
+inclusion claimed by Haiman is a strict one.
 # Context of the study
@@ -134,7 +136,7 @@ pictures with an Optical Characters Recognition (OCR) system. This prevented an
 exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As
 a prelude to project GEODE
 ([https://geode-project.github.io/](https://geode-project.github.io/)), the goal
-of CollEx-Persée was to produce a digital version of *LGE* with a quality
+of DISCO-LGE was to produce a digital version of *LGE* with a quality
 comparable to the one of l'*EDdA* provided by the ARTFL
 ([http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/))
 project in order to conduct a diachronic study of both encyclopedias.
@@ -163,7 +165,7 @@ Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
 at the end of the 17^th^ century and attacked in the
 *Dictionnaire Universel François et Latin*, commonly refered to as the
 *Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
-"Encyclopédie" remained unchanged in the four editons issued between 1721 and
+"Encyclopédie" remained unchanged in the four editions issued between 1721 and
 1752, mocking the use of the word and discouraging his readers to pursue it. In
 that intent, he quotes a poem from Pibrac encouraging people to specialise in
 only one discipline lest they should not reach perfection, based on an
@@ -187,13 +189,13 @@ what could possibly not be within reach of a single man, within a single
 lifetime may be achieved by a common effort throughout generations.
 History hints that Diderot's opponents took his defence of the feasability of
-the project quite seriously, considering the fact that they got the
+the project quite seriously, considering the fact that they got the *EDdA*'s
-*EDdA*'s privileges to be revoked again six years after its publication
+privileges to be revoked again six years after its publication was resumed
-was resumed [@moureau2001]. As a consequence, the remaining ten volumes
+[@moureau2001]. As a consequence, the remaining ten volumes containing the text
-containing the text of the articles had to be published illegally until 1765,
+of the articles had to be published illegally until 1765, thanks to the secret
-thanks to the secret protection of Malesherbes who — despite being head of royal
+protection of Malesherbes who — despite being head of royal censorship — saved
-censorship — saved the manuscripts from destruction. They were printed secretly
+the manuscripts from destruction. They were printed secretly outside of Paris
-outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
+and the books were (falsely) labeled as coming from "Neufchâtel" (*sic*).
 Following the high demand from the booksellers who feared they would lose the
 money they had invested in the project, a special privilege was issued for the
 volumes containing the plates, which were released publicly from 1762 to 1772.
@@ -245,11 +247,10 @@ to future scientific projects, which in particular requires it to be
 ([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/))
 principles (*findability*, *accessibility*, *interoperability* and
 *reusability*) which are important guideline for efficient, high-quality
-research. The XML-TEI guidelines provide tools to achieve this goal. This
+research. This section starts by describing the existing toolset provided by the
-section therefore starts by describing the existing toolset it provides, before
+XML-TEI guidelines to achieve this goal, before introducing some notations and
-introducing some notations and tools from graph theory which will be used to
+tools from graph theory which will be used to browse the guidelines in a
-browse the guidelines in a systematic and thorough way in section
+systematic and thorough way in section @sec:new-standard. 
-@sec:new-standard. 
 ## A good starting point {#sec:starting-point}
@@ -292,32 +293,57 @@ almost 80 possible child elements (79.91) within any given element, manually
 browsing such an massive network can prove quite difficult as the number of
 combinations sharply increases with each step.
-The problem can be advantageously transformed by representing this network as a
+The problem can be advantageously transformed to benefit from the results of
-graph to benefit from the results of graph theory. Classical, well-known methods
+graph theory by representing the network of the XML elements as a directed graph
-such as Dĳkstra's algorithm [@dĳkstra59] which computes the shortest path
+which nodes are connected or not depending on the inclusion rules of the
-between two nodes in a graph can then be applied 
+guidelines. Classical, well-known traversal techniques such as Dĳkstra's algorithm
+[@dĳkstra59] which computes the shortest path between two nodes in a graph and
+reports when they are not connected can then be applied to compute
+systematically all the possible ways to nest a given element under another
+without any risk to forget a route because of human error.
+Though a particular caution should be applied on the results provided by this
+algorithm because there is no guarantee that the shortest path is meaningful in
+general, it at least provides an efficient way to check whether a given element
+may or not be nested at all under another one and gives a lower bound on the
+length of a meaningful path if it exists. The accuracy of this heuristic
+decreases as the length of the path increases in the perfect graph representing
+the intended, meaningful path between two nodes that a human specialist of the
+TEI framework could build.
+The XML-TEI guidelines graph will hence be defined as follows. One node is
+created for each one of the 590 elements found in the specification. Then, an
+edge is placed between source node `A` and destination `B` if the schema states
+that the element represented by `B` can be contained directly under the element
+represented by `B`. That is, the edges in the graph represent the relation "is
+an admissible direct parent of". Please note that the word "element" is here
+used with the same meaning as in the TEI documentation to refer to the
+conceptual device characterised by a given tag name such as `p` or `div` and not
+to a particular instance of them that may occur in a given document. Figure
+@fig:dictionaries-subgraph, by using this transformation to display only the
+*dictionaries* module, hints at the overall complexity of the whole
+specification.
+![The subgraph of the *dictionaries* module](ressources/dictionaries.png){height=830px #fig:dictionaries-subgraph}
-directed graph, using elements of XML-TEI as nodes and placing edges if the
+With this definition, moving from one node to another on the graph has an
-destination node may be contained within the source node according to the
+XML-TEI counterpart. Following an edge from `A` to `B` can be understood as
-schema. Please note that the word "element" is here used with the same meaning
+preparing an XML structure of an `<A/>` element containing a `<B/>` element like
-as in the TEI documentation to refer to the conceptual device characterised by a
+this:
-given tag name such as `p` or `div` and not to a particular instance of them
-that may occur in a given document. Figure @fig:dictionaries-subgraph, by using
-this transformation to display the *dictionaries* module, hints at the overall
-complexity of the whole specification.
-![The subgraph of the *dictionaries* module](ressources/dictionaries.png){height=830px #fig:dictionaries-subgraph}
+```xml
+<A>
+    <B/>
+</A>
+```
 By iterating several times the operation of moving on that graph along one edge,
 that is, by considering the transitive closure of the relation "be connected by
 an edge" one defines *inclusion paths*, allowing to explore which elements may
-be nested under which other.
+be nested (arbitrarily deep) under which other. The nodes visited along the way
+represent the intermediate XML elements required to construct a valid XML tree
-The nodes visited along the way represent the intermediate XML elements to
+according to the TEI schema. Given the top-down semantics of those trees, the
-construct a valid XML tree according to the TEI schema. Given the top-down
+length of an inclusion path will be called its *depth*.
-semantics of those trees, the length of an inclusion path will be called its
-*depth*.
 The ability for an element to contain itself corresponds directly to loops on
 the graph (that is an edge from a node to itself) as can be illustrated by the
@@ -332,56 +358,37 @@ one, it may contain a `<geogName/>` which, in turn, may contain a new
 `<address/>` element. From a graph theory perspective, one can say that it
 admits an inclusion cycle of length two.
-Using classical, well-known methods such as Dĳkstra's algorithm [@dĳkstra59]
+Using inclusion paths lets one find for instance that although `<pos/>` may not
-lets one explore the shortest inclusion paths that exist between elements.
+be directly included within `<entry/>` elements to include information about the
-Though a particular caution should be applied because there is no guarantee that
-the shortest path is meaningful in general, it at least provides an
-efficient way to check whether a given element may or not be nested at all under
-another one and gives a lower bound on the length of the path to expect. Of
-course the accuracy of this heuristic decreases as the length of the elements
-increases in the perfect graph representing the intended, meaningful path
-between two nodes that a human specialist of the TEI framework could build.
-This is still very useful when taking into account the fact that TEI modules are
-merely "bags" to group the elements and provide hints to human encoders about
-the tools they might need but have no implication on the inclusion paths between
-elements which cross module boundaries freely. The general graph formalism
-enables one to describe complex filtering patterns and to implement queries to
-look for them among the elements exhaustively by algorithmic means even when the
-shortest-path approach is not enough.
-For instance, it lets one find that although `<pos/>` may not be directly
-included within `<entry/>` elements to include information about the
 part-of-speech of the word that an article defines, the correct way to do so is
-through a `<form/>` or a `<gramGrp/>`.
+through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
+the possible path will contain `entry-form-pos` and `entry-grapmGrp-pos`. It is
-On the other hand, trying to discover the shortest inclusion path to `<pos/>`
+left to the human encoder to rate the relevance of the path found and to select
-from the `<TEI/>` root of the document yields a `<standOff/>`, an element
+an appropriate one. A total lack of path proves the impossibility of an
-dedicated to store contextual data that accompanies but is not part of the text,
+inclusion; an abnormally high length for the shortest path is a serious hint
-not unlike an annex, and widely unrelated to the context of encoding an
+that the inclusion should not be possible and is not meaningful.
-encyclopedia.
+Another relevant example on the use of these methods can be given by querying
-A last relevant example on the use of these methods can be given by querying the
+the shortest inclusion path of a `<pos/>` under the `<body/>` of the document:
-shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
+it yields an inclusion directly through `<entryFree/>` (with an inclusion path
-yields an inclusion directly through `<entryFree/>` (with an inclusion path of
+of length 2), which unlike `<entry/>` accepts it as a direct child node.
-length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
+Possibly not what is wanted depending on the regularity of the articles being
-not what is wanted depending on the regularity of the articles being encoded and
+encoded and the occurrence of other grammatical information such as `<case/>` or
-the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
+`<gen/>` to justify the use of the `<gramGrp/>`, but searching exhaustively for
-justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
+paths up to length 3 returns as expected the path through `<entry/>`, among
-length 3 returns as expected the path through `<entry/>`, among others. The big
+others. The big picture starts to appear: `<pos/>` does not need to be nested
-picture starts to appear: `<pos/>` does not need to be nested very deep, it can
+very deep, it can appear quite near the "surface" of article entries.
-appear quite near the "surface" of article entries.
 ## Content of the module
 The central element of the *dictionaries* module is the `<entry/>` element meant
 to encode one single entry in a dictionary, that is to say a head word
 associated to its definition. It is the natural way in from the `<body/>`
-element to the dictionary module: indeed, although `<body/>` may also contain
+element to the *dictionaries* module: indeed, although `<body/>` may also
-`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
+contain `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed
-`<entry/>` while the latter is a device to group several related entries
+version of `<entry/>` while the latter is a device to group several related
-together. Both can contain an `<entry/` directly while no obvious inclusion
+entries together. Both can contain an `<entry/` directly while no obvious
-exists the other way around: most (> 96.2%) of the inclusion paths of
+inclusion exists the other way around: most (> 96.2%) of the inclusion paths of
 "reasonable" depth (which will be arbitrarily defined as strictly inferior to 5,
 that is twice the average shortest depth between any two nodes) either include
 `<figure/>` or `<castList/>`, two very specific elements which should not need
@@ -389,8 +396,8 @@ to appear in an article in general, showing that the purpose of `<entry/>` is
 not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the
 semantics conveyed by the documentation but also the structure of the elements
 graph evidence `<entry/>` as the natural top-most element for an article. This
-somewhat contrived example hopes to further demonstrate the application of a
+example demonstrate again how a graph-centred approach can provide insights
-graph-centred approach to understand the inner workings of the XML-TEI schema.
+about the XML-TEI schema.
 Once a block for an article is created, it may contain elements useful to
 represent various of its features. Its written and spoken forms are usually
@@ -504,10 +511,10 @@ which organise them into a domain classification system. Those generally cover a
 broad range of subjects from scientific disciplines to litterature, and
 extending to political subjects and law.
-No element in the *dictionaries* module is explicitely designed for the purpose
+These indicators have no element in the *dictionaries* module explicitely
-of encoding these indicators. As section @sec:dictionaries-module illustrates,
+designed to encode them. As section @sec:dictionaries-module illustrates, the
-the elements set is geared towards the words themselves instead of the concept
+elements set is geared towards the words themselves instead of the concept they
-they represent. The tool closest to what is needed can be found in the `<usg/>`
+represent. The tool closest to what is needed can be found in the `<usg/>`
 element used with a specific `type` attribute set to `dom` for "domain". Indeed
 several examples from the documentation encode subject indicators very similar
 to the ones found in encyclopedias within this element, but the match is not
@@ -515,14 +522,14 @@ perfect either: all appear within one of multiple senses, as if to clarify each
 context in which the word can be used, as expected from the element's name,
 "usage". In encyclopedias, if the domain indicator does in certain cases help to
 distinguish between several entries sharing the same headword, the concept
-itself has evolved beyond this mere distinction. Looking back at the
+itself has evolved beyond this mere distinction. Looking back at the *EDdA*, the
-*EDdA*, the adjective *raisonné* in the rest of the title directly
+adjective *raisonné* in the rest of the title directly introduces a notion of
-introduces a notion of structure that links back to the "Systême figuré des
+structure that links back to the "Systême figuré des connoissances humaines"
-connoissances humaines" [@blanchard2002, p. 1] which schematic structure is
+[@blanchard2002, p. 1] which schematic structure is shown in Figure
-shown in Figure @fig:systeme-figure. The authors have devised a branching system
+@fig:systeme-figure. The authors have devised a branching system to classify all
-to classify all knowledge, and the occurrence at the beginning of articles, more
+knowledge, and the occurrence at the beginning of articles, more than a tool to
-than a tool to clear up possible ambiguities also points the reader to the
+clear up possible ambiguities also points the reader to the correct place in
-correct place in this mind map.
+this mind map.
 !["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie ([Wikimedia Commons](https://commons.wikimedia.org/wiki/File:ENC_SYSTEME_FIGURE.jpeg?uselang=fr#filelinks))](ressources/arbre.png){width=300px #fig:systeme-figure}

--- a/biblio.bib
+++ b/biblio.bib
@@ -269,3 +269,11 @@
 	author = {d'Alembert},
 	editor = {Morrissey, Robert and Roe, Glenn},
 }
+@misc{tei_vault,
+	type = {Text},
+	title = {Previous drafts of the {Guidelines}},
+	url = {https://tei-c.org/Vault/Vault-GL.html},
+	language = {en},
+	urldate = {2023-05-31},
+}