Skip to content
Snippets Groups Projects
Commit b64c9ddb authored by Alice Brenon's avatar Alice Brenon
Browse files

Is it over yet ? Can I go to bed ?

parent 34300a87
No related branches found
No related tags found
No related merge requests found
...@@ -48,7 +48,7 @@ Finally, different strategies followed by other projects are discussed. ...@@ -48,7 +48,7 @@ Finally, different strategies followed by other projects are discussed.
Although both terms have been used rather interchangeably over the past few Although both terms have been used rather interchangeably over the past few
centuries, a dichotomy is now commonly being made between dictionaries and centuries, a dichotomy is now commonly being made between dictionaries and
encyclopedias. A simple oppositon can easily justify this distinction: encyclopedias. A simple opposition can easily justify this distinction:
dictionaries define words and tell one how to use them while encyclopedia dictionaries define words and tell one how to use them while encyclopedia
usually go into longer development to give a more comprehensive and scientific usually go into longer development to give a more comprehensive and scientific
understanding of the concept being defined. This common intuition links back to understanding of the concept being defined. This common intuition links back to
...@@ -60,8 +60,8 @@ corresponding respectively to language, history, and science and arts ...@@ -60,8 +60,8 @@ corresponding respectively to language, history, and science and arts
dictionaries. The first type corresponds to modern dictionaries while the two dictionaries. The first type corresponds to modern dictionaries while the two
others are similar to what one expects to find in an encyclopedia. others are similar to what one expects to find in an encyclopedia.
However, d'Alembert himself doesn't think of these boundaries as absolute and he However, d'Alembert himself doesn't think of these boundaries as very strict and
hints at the extreme difficulty in merely defining words without going into he hints at the extreme difficulty in merely defining words without going into
semantics and philosophical considerations: semantics and philosophical considerations:
> un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit > un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit
...@@ -87,23 +87,25 @@ dictionaries. The intrinsic complexity of dictionaries has been well identified ...@@ -87,23 +87,25 @@ dictionaries. The intrinsic complexity of dictionaries has been well identified
since the inception of the project [@tei_vault] and @ide_encoding_1995 since the inception of the project [@tei_vault] and @ide_encoding_1995
underlines the amount of work which went into the third version of the underlines the amount of work which went into the third version of the
guidelines (P3) to provide a toolbox both general and expressive enough to guidelines (P3) to provide a toolbox both general and expressive enough to
account for the variety of conventions found in dictionaries. account for the variety of conventions found in dictionaries. This module has
@romary_formal_2007 This module has been successfully used to encode both been successfully used to encode both historical [@williams2017], [@bohbot2018]
historical [@williams2017], [@bohbot2018] and digitally native dictionaries and digitally native dictionaries [@bowers_bridging_2018]. In addition, a
[@bowers_bridging_2018]. In addition, a specific guidelines tailored at encoding specific guidelines tailored at encoding dictionaries named TEI-Lex0 has also
dictionaries named TEI-Lex0 has also been published [@banski_tei_lex0_2017]. been published [@banski_tei_lex0_2017].
The TEI effort is described as "first steps" by @ide_background_1998 to reach a The TEI effort is described as "first steps" by @ide_background_1998 to reach a
standard to encode corpora and lay a common basis for corpora comparisons and standard to encode corpora and lay a common basis for corpora comparison and
reuse. They point some light inconsistencies in the design, remark that there is reuse. They point some light inconsistencies in the design, remark that there is
generally more than one way to encode a given text in XML-TEI and identify nine generally more than one way to encode a given text in XML-TEI and identify nine
criteria to design a sound standard. Their claims are backed by concrete criteria to design a sound standard. Their claims are backed by concrete
examples of encoding situations but without giving any idea of the prevalence of examples of encoding situations but give no idea of the prevalence of the issues
the issues found. In fact, the sheer complexity of the guidelines can make it reported. In fact, the sheer complexity of the guidelines can make it hard to
hard to ascertain whether a particular element structure is impossible to ascertain whether a particular element structure is impossible to represent (not
represent (not finding a suitable encoding is not a proof that there is none). finding a suitable encoding is not a proof that there is none). This chapter
This chapter will use results from graph theory to give a systematic study of will use results from graph theory to make a systematic study of the
the possibilities and shortcomings of the TEI *dictionaries* module. possibilities and shortcomings of the TEI *dictionaries* module, hence providing
an additional proof that encyclopedias are not dictionaries and that the
inclusion claimed by Haiman is a strict one.
# Context of the study # Context of the study
...@@ -134,7 +136,7 @@ pictures with an Optical Characters Recognition (OCR) system. This prevented an ...@@ -134,7 +136,7 @@ pictures with an Optical Characters Recognition (OCR) system. This prevented an
exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As
a prelude to project GEODE a prelude to project GEODE
([https://geode-project.github.io/](https://geode-project.github.io/)), the goal ([https://geode-project.github.io/](https://geode-project.github.io/)), the goal
of CollEx-Persée was to produce a digital version of *LGE* with a quality of DISCO-LGE was to produce a digital version of *LGE* with a quality
comparable to the one of l'*EDdA* provided by the ARTFL comparable to the one of l'*EDdA* provided by the ARTFL
([http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)) ([http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/))
project in order to conduct a diachronic study of both encyclopedias. project in order to conduct a diachronic study of both encyclopedias.
...@@ -163,7 +165,7 @@ Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated ...@@ -163,7 +165,7 @@ Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
at the end of the 17^th^ century and attacked in the at the end of the 17^th^ century and attacked in the
*Dictionnaire Universel François et Latin*, commonly refered to as the *Dictionnaire Universel François et Latin*, commonly refered to as the
*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for *Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
"Encyclopédie" remained unchanged in the four editons issued between 1721 and "Encyclopédie" remained unchanged in the four editions issued between 1721 and
1752, mocking the use of the word and discouraging his readers to pursue it. In 1752, mocking the use of the word and discouraging his readers to pursue it. In
that intent, he quotes a poem from Pibrac encouraging people to specialise in that intent, he quotes a poem from Pibrac encouraging people to specialise in
only one discipline lest they should not reach perfection, based on an only one discipline lest they should not reach perfection, based on an
...@@ -187,13 +189,13 @@ what could possibly not be within reach of a single man, within a single ...@@ -187,13 +189,13 @@ what could possibly not be within reach of a single man, within a single
lifetime may be achieved by a common effort throughout generations. lifetime may be achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defence of the feasability of History hints that Diderot's opponents took his defence of the feasability of
the project quite seriously, considering the fact that they got the the project quite seriously, considering the fact that they got the *EDdA*'s
*EDdA*'s privileges to be revoked again six years after its publication privileges to be revoked again six years after its publication was resumed
was resumed [@moureau2001]. As a consequence, the remaining ten volumes [@moureau2001]. As a consequence, the remaining ten volumes containing the text
containing the text of the articles had to be published illegally until 1765, of the articles had to be published illegally until 1765, thanks to the secret
thanks to the secret protection of Malesherbes who — despite being head of royal protection of Malesherbes who — despite being head of royal censorship — saved
censorship — saved the manuscripts from destruction. They were printed secretly the manuscripts from destruction. They were printed secretly outside of Paris
outside of Paris and the books were (falsely) labeled as coming from Neufchâtel. and the books were (falsely) labeled as coming from "Neufchâtel" (*sic*).
Following the high demand from the booksellers who feared they would lose the Following the high demand from the booksellers who feared they would lose the
money they had invested in the project, a special privilege was issued for the money they had invested in the project, a special privilege was issued for the
volumes containing the plates, which were released publicly from 1762 to 1772. volumes containing the plates, which were released publicly from 1762 to 1772.
...@@ -245,11 +247,10 @@ to future scientific projects, which in particular requires it to be ...@@ -245,11 +247,10 @@ to future scientific projects, which in particular requires it to be
([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) ([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/))
principles (*findability*, *accessibility*, *interoperability* and principles (*findability*, *accessibility*, *interoperability* and
*reusability*) which are important guideline for efficient, high-quality *reusability*) which are important guideline for efficient, high-quality
research. The XML-TEI guidelines provide tools to achieve this goal. This research. This section starts by describing the existing toolset provided by the
section therefore starts by describing the existing toolset it provides, before XML-TEI guidelines to achieve this goal, before introducing some notations and
introducing some notations and tools from graph theory which will be used to tools from graph theory which will be used to browse the guidelines in a
browse the guidelines in a systematic and thorough way in section systematic and thorough way in section @sec:new-standard.
@sec:new-standard.
## A good starting point {#sec:starting-point} ## A good starting point {#sec:starting-point}
...@@ -292,32 +293,57 @@ almost 80 possible child elements (79.91) within any given element, manually ...@@ -292,32 +293,57 @@ almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step. combinations sharply increases with each step.
The problem can be advantageously transformed by representing this network as a The problem can be advantageously transformed to benefit from the results of
graph to benefit from the results of graph theory. Classical, well-known methods graph theory by representing the network of the XML elements as a directed graph
such as Dijkstra's algorithm [@dijkstra59] which computes the shortest path which nodes are connected or not depending on the inclusion rules of the
between two nodes in a graph can then be applied guidelines. Classical, well-known traversal techniques such as Dijkstra's algorithm
[@dijkstra59] which computes the shortest path between two nodes in a graph and
reports when they are not connected can then be applied to compute
systematically all the possible ways to nest a given element under another
without any risk to forget a route because of human error.
Though a particular caution should be applied on the results provided by this
algorithm because there is no guarantee that the shortest path is meaningful in
general, it at least provides an efficient way to check whether a given element
may or not be nested at all under another one and gives a lower bound on the
length of a meaningful path if it exists. The accuracy of this heuristic
decreases as the length of the path increases in the perfect graph representing
the intended, meaningful path between two nodes that a human specialist of the
TEI framework could build.
The XML-TEI guidelines graph will hence be defined as follows. One node is
created for each one of the 590 elements found in the specification. Then, an
edge is placed between source node `A` and destination `B` if the schema states
that the element represented by `B` can be contained directly under the element
represented by `B`. That is, the edges in the graph represent the relation "is
an admissible direct parent of". Please note that the word "element" is here
used with the same meaning as in the TEI documentation to refer to the
conceptual device characterised by a given tag name such as `p` or `div` and not
to a particular instance of them that may occur in a given document. Figure
@fig:dictionaries-subgraph, by using this transformation to display only the
*dictionaries* module, hints at the overall complexity of the whole
specification.
![The subgraph of the *dictionaries* module](ressources/dictionaries.png){height=830px #fig:dictionaries-subgraph}
directed graph, using elements of XML-TEI as nodes and placing edges if the With this definition, moving from one node to another on the graph has an
destination node may be contained within the source node according to the XML-TEI counterpart. Following an edge from `A` to `B` can be understood as
schema. Please note that the word "element" is here used with the same meaning preparing an XML structure of an `<A/>` element containing a `<B/>` element like
as in the TEI documentation to refer to the conceptual device characterised by a this:
given tag name such as `p` or `div` and not to a particular instance of them
that may occur in a given document. Figure @fig:dictionaries-subgraph, by using
this transformation to display the *dictionaries* module, hints at the overall
complexity of the whole specification.
![The subgraph of the *dictionaries* module](ressources/dictionaries.png){height=830px #fig:dictionaries-subgraph} ```xml
<A>
<B/>
</A>
```
By iterating several times the operation of moving on that graph along one edge, By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by that is, by considering the transitive closure of the relation "be connected by
an edge" one defines *inclusion paths*, allowing to explore which elements may an edge" one defines *inclusion paths*, allowing to explore which elements may
be nested under which other. be nested (arbitrarily deep) under which other. The nodes visited along the way
represent the intermediate XML elements required to construct a valid XML tree
The nodes visited along the way represent the intermediate XML elements to according to the TEI schema. Given the top-down semantics of those trees, the
construct a valid XML tree according to the TEI schema. Given the top-down length of an inclusion path will be called its *depth*.
semantics of those trees, the length of an inclusion path will be called its
*depth*.
The ability for an element to contain itself corresponds directly to loops on The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the the graph (that is an edge from a node to itself) as can be illustrated by the
...@@ -332,56 +358,37 @@ one, it may contain a `<geogName/>` which, in turn, may contain a new ...@@ -332,56 +358,37 @@ one, it may contain a `<geogName/>` which, in turn, may contain a new
`<address/>` element. From a graph theory perspective, one can say that it `<address/>` element. From a graph theory perspective, one can say that it
admits an inclusion cycle of length two. admits an inclusion cycle of length two.
Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59] Using inclusion paths lets one find for instance that although `<pos/>` may not
lets one explore the shortest inclusion paths that exist between elements. be directly included within `<entry/>` elements to include information about the
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides an
efficient way to check whether a given element may or not be nested at all under
another one and gives a lower bound on the length of the path to expect. Of
course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build.
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
elements which cross module boundaries freely. The general graph formalism
enables one to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.
For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is part-of-speech of the word that an article defines, the correct way to do so is
through a `<form/>` or a `<gramGrp/>`. through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
the possible path will contain `entry-form-pos` and `entry-grapmGrp-pos`. It is
On the other hand, trying to discover the shortest inclusion path to `<pos/>` left to the human encoder to rate the relevance of the path found and to select
from the `<TEI/>` root of the document yields a `<standOff/>`, an element an appropriate one. A total lack of path proves the impossibility of an
dedicated to store contextual data that accompanies but is not part of the text, inclusion; an abnormally high length for the shortest path is a serious hint
not unlike an annex, and widely unrelated to the context of encoding an that the inclusion should not be possible and is not meaningful.
encyclopedia.
Another relevant example on the use of these methods can be given by querying
A last relevant example on the use of these methods can be given by querying the the shortest inclusion path of a `<pos/>` under the `<body/>` of the document:
shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it it yields an inclusion directly through `<entryFree/>` (with an inclusion path
yields an inclusion directly through `<entryFree/>` (with an inclusion path of of length 2), which unlike `<entry/>` accepts it as a direct child node.
length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly Possibly not what is wanted depending on the regularity of the articles being
not what is wanted depending on the regularity of the articles being encoded and encoded and the occurrence of other grammatical information such as `<case/>` or
the occurrence of other grammatical information such as `<case/>` or `<gen/>` to `<gen/>` to justify the use of the `<gramGrp/>`, but searching exhaustively for
justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to paths up to length 3 returns as expected the path through `<entry/>`, among
length 3 returns as expected the path through `<entry/>`, among others. The big others. The big picture starts to appear: `<pos/>` does not need to be nested
picture starts to appear: `<pos/>` does not need to be nested very deep, it can very deep, it can appear quite near the "surface" of article entries.
appear quite near the "surface" of article entries.
## Content of the module ## Content of the module
The central element of the *dictionaries* module is the `<entry/>` element meant The central element of the *dictionaries* module is the `<entry/>` element meant
to encode one single entry in a dictionary, that is to say a head word to encode one single entry in a dictionary, that is to say a head word
associated to its definition. It is the natural way in from the `<body/>` associated to its definition. It is the natural way in from the `<body/>`
element to the dictionary module: indeed, although `<body/>` may also contain element to the *dictionaries* module: indeed, although `<body/>` may also
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of contain `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed
`<entry/>` while the latter is a device to group several related entries version of `<entry/>` while the latter is a device to group several related
together. Both can contain an `<entry/` directly while no obvious inclusion entries together. Both can contain an `<entry/` directly while no obvious
exists the other way around: most (> 96.2%) of the inclusion paths of inclusion exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which will be arbitrarily defined as strictly inferior to 5, "reasonable" depth (which will be arbitrarily defined as strictly inferior to 5,
that is twice the average shortest depth between any two nodes) either include that is twice the average shortest depth between any two nodes) either include
`<figure/>` or `<castList/>`, two very specific elements which should not need `<figure/>` or `<castList/>`, two very specific elements which should not need
...@@ -389,8 +396,8 @@ to appear in an article in general, showing that the purpose of `<entry/>` is ...@@ -389,8 +396,8 @@ to appear in an article in general, showing that the purpose of `<entry/>` is
not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the
semantics conveyed by the documentation but also the structure of the elements semantics conveyed by the documentation but also the structure of the elements
graph evidence `<entry/>` as the natural top-most element for an article. This graph evidence `<entry/>` as the natural top-most element for an article. This
somewhat contrived example hopes to further demonstrate the application of a example demonstrate again how a graph-centred approach can provide insights
graph-centred approach to understand the inner workings of the XML-TEI schema. about the XML-TEI schema.
Once a block for an article is created, it may contain elements useful to Once a block for an article is created, it may contain elements useful to
represent various of its features. Its written and spoken forms are usually represent various of its features. Its written and spoken forms are usually
...@@ -504,10 +511,10 @@ which organise them into a domain classification system. Those generally cover a ...@@ -504,10 +511,10 @@ which organise them into a domain classification system. Those generally cover a
broad range of subjects from scientific disciplines to litterature, and broad range of subjects from scientific disciplines to litterature, and
extending to political subjects and law. extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose These indicators have no element in the *dictionaries* module explicitely
of encoding these indicators. As section @sec:dictionaries-module illustrates, designed to encode them. As section @sec:dictionaries-module illustrates, the
the elements set is geared towards the words themselves instead of the concept elements set is geared towards the words themselves instead of the concept they
they represent. The tool closest to what is needed can be found in the `<usg/>` represent. The tool closest to what is needed can be found in the `<usg/>`
element used with a specific `type` attribute set to `dom` for "domain". Indeed element used with a specific `type` attribute set to `dom` for "domain". Indeed
several examples from the documentation encode subject indicators very similar several examples from the documentation encode subject indicators very similar
to the ones found in encyclopedias within this element, but the match is not to the ones found in encyclopedias within this element, but the match is not
...@@ -515,14 +522,14 @@ perfect either: all appear within one of multiple senses, as if to clarify each ...@@ -515,14 +522,14 @@ perfect either: all appear within one of multiple senses, as if to clarify each
context in which the word can be used, as expected from the element's name, context in which the word can be used, as expected from the element's name,
"usage". In encyclopedias, if the domain indicator does in certain cases help to "usage". In encyclopedias, if the domain indicator does in certain cases help to
distinguish between several entries sharing the same headword, the concept distinguish between several entries sharing the same headword, the concept
itself has evolved beyond this mere distinction. Looking back at the itself has evolved beyond this mere distinction. Looking back at the *EDdA*, the
*EDdA*, the adjective *raisonné* in the rest of the title directly adjective *raisonné* in the rest of the title directly introduces a notion of
introduces a notion of structure that links back to the "Systême figuré des structure that links back to the "Systême figuré des connoissances humaines"
connoissances humaines" [@blanchard2002, p. 1] which schematic structure is [@blanchard2002, p. 1] which schematic structure is shown in Figure
shown in Figure @fig:systeme-figure. The authors have devised a branching system @fig:systeme-figure. The authors have devised a branching system to classify all
to classify all knowledge, and the occurrence at the beginning of articles, more knowledge, and the occurrence at the beginning of articles, more than a tool to
than a tool to clear up possible ambiguities also points the reader to the clear up possible ambiguities also points the reader to the correct place in
correct place in this mind map. this mind map.
!["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie ([Wikimedia Commons](https://commons.wikimedia.org/wiki/File:ENC_SYSTEME_FIGURE.jpeg?uselang=fr#filelinks))](ressources/arbre.png){width=300px #fig:systeme-figure} !["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie ([Wikimedia Commons](https://commons.wikimedia.org/wiki/File:ENC_SYSTEME_FIGURE.jpeg?uselang=fr#filelinks))](ressources/arbre.png){width=300px #fig:systeme-figure}
......
...@@ -269,3 +269,11 @@ ...@@ -269,3 +269,11 @@
author = {d'Alembert}, author = {d'Alembert},
editor = {Morrissey, Robert and Roe, Glenn}, editor = {Morrissey, Robert and Roe, Glenn},
} }
@misc{tei_vault,
type = {Text},
title = {Previous drafts of the {Guidelines}},
url = {https://tei-c.org/Vault/Vault-GL.html},
language = {en},
urldate = {2023-05-31},
}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment