From a9aa1bcb53e1872211817c7601e84f0b5eca2502 Mon Sep 17 00:00:00 2001 From: Alice BRENON <alice.brenon@ens-lyon.fr> Date: Sun, 27 Feb 2022 18:33:41 +0100 Subject: [PATCH] Restructure the graph-theory approach introduction to air it a bit --- ICHLL_Brenon.md | 77 +++++++++++++++++++++++++++++++++---------------- 1 file changed, 52 insertions(+), 25 deletions(-) diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index 7184e56..028b734 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -167,20 +167,43 @@ The XML-TEI specification contains 590 elements, which are each documented on the consortium's website in the online reference pages. With an average of almost 80 possible child elements (79.91) within any given element, manually browsing such an massive network can prove quite difficult as the number of -combinations sharply increases with each step. We transform the problem by -representing this network as a directed graph, using elements of XML-TEI as -nodes and placing edges if the destination node may be contained within the -source node according to the schema. +combinations sharply increases with each step. + +We transform the problem by representing this network as a directed graph, using +elements of XML-TEI as nodes and placing edges if the destination node may be +contained within the source node according to the schema. Please note that the +"element" word is here used with the same meaning as in the TEI documentation to +refer to the conceptual device characterised by a given tag name such as `p` or +`div` and not to a particular instance of them that may occur in a given +document.  +### Definitions + By iterating several times the operation of moving on that graph along one edge, that is, by considering the transitive closure of the relation "be connected by an edge" we define *inclusion paths* which allow us to explore which elements -may be nested under one another. The nodes visited along the way represent the -intermediate XML elements to construct a valid XML tree according to the TEI -schema. Given the top-down semantics of those trees, we call the length of an -inclusion path its *depth*. +may be nested under one another. + +The nodes visited along the way represent the intermediate XML elements to +construct a valid XML tree according to the TEI schema. Given the top-down +semantics of those trees, we call the length of an inclusion path its *depth*. + +The ability for an element to contain itself corresponds directly to loops on +the graph (that is an edge from a node to itself) as can be illustrated by the +`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain +another one. + +The generalisation of this to inclusion paths of any length greater than one is +usually called a cycle in and we may be tempted in our context to refine this +name them to *inclusion cycles*. The `<address/>` element provides us with an +example for this configuration: although an `<address/>` element may not +directly contain another one, it may contain a `<geogName/>` which, in turn, may +contain a new `<address/>` element. From a graph theory perspective, we can say +that it admits an inclusion cycle of length two. + +### Applications Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959) allows us to explore the shortest inclusion paths that exist between elements. @@ -190,8 +213,9 @@ efficient way to check whether a given element may or not be nested at all under another one and gives an order of magnitude on the length of the path to expect. Of course the accuracy of this heuristic decreases as the length of the elements increases in the perfect graph representing the intended, meaningful path -between two nodes that a human specialist of the TEI framework could build. This -is still very useful when taking into account the fact that TEI modules are +between two nodes that a human specialist of the TEI framework could build. + +This is still very useful when taking into account the fact that TEI modules are merely "bags" to group the elements and provide hints to human encoders about the tools they might need but have no implication on the inclusion paths between element which cross module boundaries freely. The general graph formalism @@ -202,21 +226,24 @@ shortest-path approach is not enough. For instance, it lets one find that although `<pos/>` may not be directly included within `<entry/>` elements to include information about the part-of-speech of the word that an article defines, the correct way to do so is -through a `<form/>` or a `<gramGrp/>`. On the other hand, trying to discover the -shortest inclusion path to `<pos/>` from the `<TEI/>` root of the document -yields a `<standOff/>`, an element dedicated to store contextual data that -accompanies but is not part of the text, not unlike an annex, and widely -unrelated to the context of encoding an encyclopedia. A last relevant example on -the use of these methods can be given by querying the shortest inclusion path of -a `<pos/>` under the `<body/>` of the document: it yields an inclusion directly -through `<entryFree/>` (with an inclusion path of length 2), which unlike -`<entry/>` accepts it as a direct child node. Possibly not what we want -depending on the regularity of the articles we are encoding and the occurrence -of other grammatical information such as `<case/>` or `<gen/>` to justify the -use of the `<gramGrp/>`, but searching exhaustively for paths up to length 3 -returns as expected the path through `<entry/>`, among others. Overall, we get a -good general idea: `<pos/>` does not need to be nested very deep, it can appear -quite near the "surface" of article entries. +through a `<form/>` or a `<gramGrp/>`. + +On the other hand, trying to discover the shortest inclusion path to `<pos/>` +from the `<TEI/>` root of the document yields a `<standOff/>`, an element +dedicated to store contextual data that accompanies but is not part of the text, +not unlike an annex, and widely unrelated to the context of encoding an +encyclopedia. + +A last relevant example on the use of these methods can be given by querying the +shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it +yields an inclusion directly through `<entryFree/>` (with an inclusion path of +length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly +not what we want depending on the regularity of the articles we are encoding and +the occurrence of other grammatical information such as `<case/>` or `<gen/>` to +justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to +length 3 returns as expected the path through `<entry/>`, among others. Overall, +we get a good general idea: `<pos/>` does not need to be nested very deep, it +can appear quite near the "surface" of article entries. ### The `<entry/>` element -- GitLab