principles (*findability*, *accessibility*, *interoperability* and
principles (*findability*, *accessibility*, *interoperability* and
*reusability*) which are important guideline for efficient, high-quality
*reusability*) which are important guideline for efficient, high-quality
research. The XML-TEI guidelines provide tools to achieve this goal. This
research. This section starts by describing the existing toolset provided by the
section therefore starts by describing the existing toolset it provides, before
XML-TEI guidelines to achieve this goal, before introducing some notations and
introducing some notations and tools from graph theory which will be used to
tools from graph theory which will be used to browse the guidelines in a
browse the guidelines in a systematic and thorough way in section
systematic and thorough way in section @sec:new-standard.
@sec:new-standard.
## A good starting point {#sec:starting-point}
## A good starting point {#sec:starting-point}
...
@@ -292,32 +293,57 @@ almost 80 possible child elements (79.91) within any given element, manually
...
@@ -292,32 +293,57 @@ almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of
browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step.
combinations sharply increases with each step.
The problem can be advantageously transformed by representing this network as a
The problem can be advantageously transformed to benefit from the results of
graph to benefit from the results of graph theory. Classical, well-known methods
graph theory by representing the network of the XML elements as a directed graph
such as Dijkstra's algorithm [@dijkstra59] which computes the shortest path
which nodes are connected or not depending on the inclusion rules of the
between two nodes in a graph can then be applied
guidelines. Classical, well-known traversal techniques such as Dijkstra's algorithm
[@dijkstra59] which computes the shortest path between two nodes in a graph and
reports when they are not connected can then be applied to compute
systematically all the possible ways to nest a given element under another
without any risk to forget a route because of human error.
Though a particular caution should be applied on the results provided by this
algorithm because there is no guarantee that the shortest path is meaningful in
general, it at least provides an efficient way to check whether a given element
may or not be nested at all under another one and gives a lower bound on the
length of a meaningful path if it exists. The accuracy of this heuristic
decreases as the length of the path increases in the perfect graph representing
the intended, meaningful path between two nodes that a human specialist of the
TEI framework could build.
The XML-TEI guidelines graph will hence be defined as follows. One node is
created for each one of the 590 elements found in the specification. Then, an
edge is placed between source node `A` and destination `B` if the schema states
that the element represented by `B` can be contained directly under the element
represented by `B`. That is, the edges in the graph represent the relation "is
an admissible direct parent of". Please note that the word "element" is here
used with the same meaning as in the TEI documentation to refer to the
conceptual device characterised by a given tag name such as `p` or `div` and not
to a particular instance of them that may occur in a given document. Figure
@fig:dictionaries-subgraph, by using this transformation to display only the
*dictionaries* module, hints at the overall complexity of the whole
specification.
{height=830px #fig:dictionaries-subgraph}
directed graph, using elements of XML-TEI as nodes and placing edges if the
With this definition, moving from one node to another on the graph has an
destination node may be contained within the source node according to the
XML-TEI counterpart. Following an edge from `A` to `B` can be understood as
schema. Please note that the word "element" is here used with the same meaning
preparing an XML structure of an `<A/>` element containing a `<B/>` element like
as in the TEI documentation to refer to the conceptual device characterised by a
this:
given tag name such as `p` or `div` and not to a particular instance of them
that may occur in a given document. Figure @fig:dictionaries-subgraph, by using
this transformation to display the *dictionaries* module, hints at the overall
complexity of the whole specification.
{height=830px #fig:dictionaries-subgraph}
```xml
<A>
<B/>
</A>
```
By iterating several times the operation of moving on that graph along one edge,
By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
that is, by considering the transitive closure of the relation "be connected by
an edge" one defines *inclusion paths*, allowing to explore which elements may
an edge" one defines *inclusion paths*, allowing to explore which elements may
be nested under which other.
be nested (arbitrarily deep) under which other. The nodes visited along the way
represent the intermediate XML elements required to construct a valid XML tree
The nodes visited along the way represent the intermediate XML elements to
according to the TEI schema. Given the top-down semantics of those trees, the
construct a valid XML tree according to the TEI schema. Given the top-down
length of an inclusion path will be called its *depth*.
semantics of those trees, the length of an inclusion path will be called its
*depth*.
The ability for an element to contain itself corresponds directly to loops on
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
the graph (that is an edge from a node to itself) as can be illustrated by the
...
@@ -332,56 +358,37 @@ one, it may contain a `<geogName/>` which, in turn, may contain a new
...
@@ -332,56 +358,37 @@ one, it may contain a `<geogName/>` which, in turn, may contain a new
`<address/>` element. From a graph theory perspective, one can say that it
`<address/>` element. From a graph theory perspective, one can say that it
admits an inclusion cycle of length two.
admits an inclusion cycle of length two.
Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
Using inclusion paths lets one find for instance that although `<pos/>` may not
lets one explore the shortest inclusion paths that exist between elements.
be directly included within `<entry/>` elements to include information about the
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides an
efficient way to check whether a given element may or not be nested at all under
another one and gives a lower bound on the length of the path to expect. Of
course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build.
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
elements which cross module boundaries freely. The general graph formalism
enables one to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.
For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
part-of-speech of the word that an article defines, the correct way to do so is
through a `<form/>` or a `<gramGrp/>`.
through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
the possible path will contain `entry-form-pos` and `entry-grapmGrp-pos`. It is
On the other hand, trying to discover the shortest inclusion path to `<pos/>`
left to the human encoder to rate the relevance of the path found and to select
from the `<TEI/>` root of the document yields a `<standOff/>`, an element
an appropriate one. A total lack of path proves the impossibility of an
dedicated to store contextual data that accompanies but is not part of the text,
inclusion; an abnormally high length for the shortest path is a serious hint
not unlike an annex, and widely unrelated to the context of encoding an
that the inclusion should not be possible and is not meaningful.
encyclopedia.
Another relevant example on the use of these methods can be given by querying
A last relevant example on the use of these methods can be given by querying the
the shortest inclusion path of a `<pos/>` under the `<body/>` of the document:
shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
it yields an inclusion directly through `<entryFree/>` (with an inclusion path
yields an inclusion directly through `<entryFree/>` (with an inclusion path of
of length 2), which unlike `<entry/>` accepts it as a direct child node.
length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
Possibly not what is wanted depending on the regularity of the articles being
not what is wanted depending on the regularity of the articles being encoded and
encoded and the occurrence of other grammatical information such as `<case/>` or
the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
`<gen/>` to justify the use of the `<gramGrp/>`, but searching exhaustively for
justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
paths up to length 3 returns as expected the path through `<entry/>`, among
length 3 returns as expected the path through `<entry/>`, among others. The big
others. The big picture starts to appear: `<pos/>` does not need to be nested
picture starts to appear: `<pos/>` does not need to be nested very deep, it can
very deep, it can appear quite near the "surface" of article entries.
appear quite near the "surface" of article entries.
## Content of the module
## Content of the module
The central element of the *dictionaries* module is the `<entry/>` element meant
The central element of the *dictionaries* module is the `<entry/>` element meant
to encode one single entry in a dictionary, that is to say a head word
to encode one single entry in a dictionary, that is to say a head word
associated to its definition. It is the natural way in from the `<body/>`
associated to its definition. It is the natural way in from the `<body/>`
element to the dictionary module: indeed, although `<body/>` may also contain
element to the *dictionaries* module: indeed, although `<body/>` may also
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
contain `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed
`<entry/>` while the latter is a device to group several related entries
version of `<entry/>` while the latter is a device to group several related
together. Both can contain an `<entry/` directly while no obvious inclusion
entries together. Both can contain an `<entry/` directly while no obvious
exists the other way around: most (> 96.2%) of the inclusion paths of
inclusion exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which will be arbitrarily defined as strictly inferior to 5,
"reasonable" depth (which will be arbitrarily defined as strictly inferior to 5,
that is twice the average shortest depth between any two nodes) either include
that is twice the average shortest depth between any two nodes) either include
`<figure/>` or `<castList/>`, two very specific elements which should not need
`<figure/>` or `<castList/>`, two very specific elements which should not need
...
@@ -389,8 +396,8 @@ to appear in an article in general, showing that the purpose of `<entry/>` is
...
@@ -389,8 +396,8 @@ to appear in an article in general, showing that the purpose of `<entry/>` is
not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the
not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the
semantics conveyed by the documentation but also the structure of the elements
semantics conveyed by the documentation but also the structure of the elements
graph evidence `<entry/>` as the natural top-most element for an article. This
graph evidence `<entry/>` as the natural top-most element for an article. This
somewhat contrived example hopes to further demonstrate the application of a
example demonstrate again how a graph-centred approach can provide insights
graph-centred approach to understand the inner workings of the XML-TEI schema.
about the XML-TEI schema.
Once a block for an article is created, it may contain elements useful to
Once a block for an article is created, it may contain elements useful to
represent various of its features. Its written and spoken forms are usually
represent various of its features. Its written and spoken forms are usually
...
@@ -504,10 +511,10 @@ which organise them into a domain classification system. Those generally cover a
...
@@ -504,10 +511,10 @@ which organise them into a domain classification system. Those generally cover a
broad range of subjects from scientific disciplines to litterature, and
broad range of subjects from scientific disciplines to litterature, and
extending to political subjects and law.
extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose
These indicators have no element in the *dictionaries* module explicitely
of encoding these indicators. As section @sec:dictionaries-module illustrates,
designed to encode them. As section @sec:dictionaries-module illustrates, the
the elements set is geared towards the words themselves instead of the concept
elements set is geared towards the words themselves instead of the concept they
they represent. The tool closest to what is needed can be found in the `<usg/>`
represent. The tool closest to what is needed can be found in the `<usg/>`
element used with a specific `type` attribute set to `dom` for "domain". Indeed
element used with a specific `type` attribute set to `dom` for "domain". Indeed
several examples from the documentation encode subject indicators very similar
several examples from the documentation encode subject indicators very similar
to the ones found in encyclopedias within this element, but the match is not
to the ones found in encyclopedias within this element, but the match is not
...
@@ -515,14 +522,14 @@ perfect either: all appear within one of multiple senses, as if to clarify each
...
@@ -515,14 +522,14 @@ perfect either: all appear within one of multiple senses, as if to clarify each
context in which the word can be used, as expected from the element's name,
context in which the word can be used, as expected from the element's name,
"usage". In encyclopedias, if the domain indicator does in certain cases help to
"usage". In encyclopedias, if the domain indicator does in certain cases help to
distinguish between several entries sharing the same headword, the concept
distinguish between several entries sharing the same headword, the concept
itself has evolved beyond this mere distinction. Looking back at the
itself has evolved beyond this mere distinction. Looking back at the*EDdA*, the
*EDdA*, the adjective *raisonné* in the rest of the title directly
adjective *raisonné* in the rest of the title directly introduces a notion of
introduces a notion of structure that links back to the "Systême figuré des
structure that links back to the "Systême figuré des connoissances humaines"
connoissances humaines" [@blanchard2002, p. 1] which schematic structure is
[@blanchard2002, p. 1] which schematic structure is shown in Figure
shown in Figure @fig:systeme-figure. The authors have devised a branching system
@fig:systeme-figure. The authors have devised a branching system to classify all
to classify all knowledge, and the occurrence at the beginning of articles, more
knowledge, and the occurrence at the beginning of articles, more than a tool to
than a tool to clear up possible ambiguities also points the reader to the
clear up possible ambiguities also points the reader to the correct place in
correct place in this mind map.
this mind map.
)](ressources/arbre.png){width=300px #fig:systeme-figure}
)](ressources/arbre.png){width=300px #fig:systeme-figure}