Skip to content
Snippets Groups Projects
Commit a9aa1bcb authored by Alice Brenon's avatar Alice Brenon
Browse files

Restructure the graph-theory approach introduction to air it a bit

parent 0e39df93
No related branches found
No related tags found
No related merge requests found
...@@ -167,20 +167,43 @@ The XML-TEI specification contains 590 elements, which are each documented on ...@@ -167,20 +167,43 @@ The XML-TEI specification contains 590 elements, which are each documented on
the consortium's website in the online reference pages. With an average of the consortium's website in the online reference pages. With an average of
almost 80 possible child elements (79.91) within any given element, manually almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step. We transform the problem by combinations sharply increases with each step.
representing this network as a directed graph, using elements of XML-TEI as
nodes and placing edges if the destination node may be contained within the We transform the problem by representing this network as a directed graph, using
source node according to the schema. elements of XML-TEI as nodes and placing edges if the destination node may be
contained within the source node according to the schema. Please note that the
"element" word is here used with the same meaning as in the TEI documentation to
refer to the conceptual device characterised by a given tag name such as `p` or
`div` and not to a particular instance of them that may occur in a given
document.
![The subgraph of the *dictionaries* module](ressources/dictionaries.png) ![The subgraph of the *dictionaries* module](ressources/dictionaries.png)
### Definitions
By iterating several times the operation of moving on that graph along one edge, By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by that is, by considering the transitive closure of the relation "be connected by
an edge" we define *inclusion paths* which allow us to explore which elements an edge" we define *inclusion paths* which allow us to explore which elements
may be nested under one another. The nodes visited along the way represent the may be nested under one another.
intermediate XML elements to construct a valid XML tree according to the TEI
schema. Given the top-down semantics of those trees, we call the length of an The nodes visited along the way represent the intermediate XML elements to
inclusion path its *depth*. construct a valid XML tree according to the TEI schema. Given the top-down
semantics of those trees, we call the length of an inclusion path its *depth*.
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
another one.
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle in and we may be tempted in our context to refine this
name them to *inclusion cycles*. The `<address/>` element provides us with an
example for this configuration: although an `<address/>` element may not
directly contain another one, it may contain a `<geogName/>` which, in turn, may
contain a new `<address/>` element. From a graph theory perspective, we can say
that it admits an inclusion cycle of length two.
### Applications
Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959) Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959)
allows us to explore the shortest inclusion paths that exist between elements. allows us to explore the shortest inclusion paths that exist between elements.
...@@ -190,8 +213,9 @@ efficient way to check whether a given element may or not be nested at all under ...@@ -190,8 +213,9 @@ efficient way to check whether a given element may or not be nested at all under
another one and gives an order of magnitude on the length of the path to expect. another one and gives an order of magnitude on the length of the path to expect.
Of course the accuracy of this heuristic decreases as the length of the elements Of course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build. This between two nodes that a human specialist of the TEI framework could build.
is still very useful when taking into account the fact that TEI modules are
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between the tools they might need but have no implication on the inclusion paths between
element which cross module boundaries freely. The general graph formalism element which cross module boundaries freely. The general graph formalism
...@@ -202,21 +226,24 @@ shortest-path approach is not enough. ...@@ -202,21 +226,24 @@ shortest-path approach is not enough.
For instance, it lets one find that although `<pos/>` may not be directly For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is part-of-speech of the word that an article defines, the correct way to do so is
through a `<form/>` or a `<gramGrp/>`. On the other hand, trying to discover the through a `<form/>` or a `<gramGrp/>`.
shortest inclusion path to `<pos/>` from the `<TEI/>` root of the document
yields a `<standOff/>`, an element dedicated to store contextual data that On the other hand, trying to discover the shortest inclusion path to `<pos/>`
accompanies but is not part of the text, not unlike an annex, and widely from the `<TEI/>` root of the document yields a `<standOff/>`, an element
unrelated to the context of encoding an encyclopedia. A last relevant example on dedicated to store contextual data that accompanies but is not part of the text,
the use of these methods can be given by querying the shortest inclusion path of not unlike an annex, and widely unrelated to the context of encoding an
a `<pos/>` under the `<body/>` of the document: it yields an inclusion directly encyclopedia.
through `<entryFree/>` (with an inclusion path of length 2), which unlike
`<entry/>` accepts it as a direct child node. Possibly not what we want A last relevant example on the use of these methods can be given by querying the
depending on the regularity of the articles we are encoding and the occurrence shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
of other grammatical information such as `<case/>` or `<gen/>` to justify the yields an inclusion directly through `<entryFree/>` (with an inclusion path of
use of the `<gramGrp/>`, but searching exhaustively for paths up to length 3 length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
returns as expected the path through `<entry/>`, among others. Overall, we get a not what we want depending on the regularity of the articles we are encoding and
good general idea: `<pos/>` does not need to be nested very deep, it can appear the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
quite near the "surface" of article entries. justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
length 3 returns as expected the path through `<entry/>`, among others. Overall,
we get a good general idea: `<pos/>` does not need to be nested very deep, it
can appear quite near the "surface" of article entries.
### The `<entry/>` element ### The `<entry/>` element
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment