diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index b8efb0b0ace7f6e5c98dae9e62c97f25d430cc58..c57bbbf340e1eec3655b919d7050c35661863b28 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -183,30 +183,32 @@ Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959 allows us to explore the shortest inclusion paths that exist between elements. Though a particular caution should be applied because there is no guarantee that the shortest path is meaningful in general, it at least provides us with an -efficient way to check whether a given element may or not be nested under -another one at all and gives an order of magnitude on the length of the path to -expect. Of course the accuracy of this heuristic decreases as the length of the -elements increases in a perfect graph representing the intended, meaningful path -between two nodes, but this formalism lets us consider elements combinations -rationally and exhaustively by algorithmic means. +efficient way to check whether a given element may or not be nested at all under +another one and gives an order of magnitude on the length of the path to expect. +Of course the accuracy of this heuristic decreases as the length of the elements +increases in a perfect graph representing the intended, meaningful path between +two nodes, but the general graph formalism enables us to extend the results +produced by the shortest-path approach and consider elements combinations +rationally and exhaustively by algorithmic means should the need occur. For instance, it lets one find that although `<pos/>` may not be directly included within `<entry/>` elements to include information about the part-of-speech of the word that an article defines, the correct way to do so is -through a `<gramGrp/>`. On the other hand, trying to discover the shortest -inclusion path to `<pos/>` from the `<TEI/>` root of the document yields a -`<standOff/>`, an element dedicated to store contextual data that accompanies -but is not part of the text, not unlike an annex, and probably not what we want -in the context of encoding an encyclopedia. A last relevant example on the use -of this approach can be given by querying the shortest inclusion path of a -`<pos/>` under the `<body/>` of the document: it yields an inclusion directly -through `<entryFree/>` (with an inclusion path of length 2), which, unlike -`<entry/>` allows it as a direct child node. Possibly not what we want depending -on the regularity of the articles we are encoding and the existence of other -grammatical information such as `<case/>` or `<gen/>` in languages with an -inflexion system to justify the use of the `<gramGrp/>`, but it gives a good -general idea: `<pos/>` does not need to be nested very deep, it can appear quite -near the "surface" of article entries. +through a `<form/>` or a `<gramGrp/>`. On the other hand, trying to discover the +shortest inclusion path to `<pos/>` from the `<TEI/>` root of the document +yields a `<standOff/>`, an element dedicated to store contextual data that +accompanies but is not part of the text, not unlike an annex, and widely +unrelated to the context of encoding an encyclopedia. A last relevant example on +the use of these methods can be given by querying the shortest inclusion path of +a `<pos/>` under the `<body/>` of the document: it yields an inclusion directly +through `<entryFree/>` (with an inclusion path of length 2), which unlike +`<entry/>` accepts it as a direct child node. Possibly not what we want +depending on the regularity of the articles we are encoding and the occurrence +of other grammatical information such as `<case/>` or `<gen/>` to justify the +use of the `<gramGrp/>`, but searching exhaustively for paths up to length 3 +returns as expected the path through `<entry/>`, among others. Overall, we get a +good general idea: `<pos/>` does not need to be nested very deep, it can appear +quite near the "surface" of article entries. ### The `<entry/>` element @@ -234,7 +236,8 @@ represent features such as - a group of grammatical information: `<gramGrp/>`, that may itself contain as we've seen above `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to describe the form itself for instance, but also information about the categories it belongs - to like `<iType/>` for its inflexion class or `<pos/>` for its part-of-speech + to like `<iType/>` for its inflection class in languages with a declension + system or `<pos/>` for its part-of-speech - its etymology - its variants if there is a different spelling in a variety of the language or if it has changed through time @@ -274,7 +277,38 @@ definition of the term with `<def/>`, usage examples with `<usg/>` and other high-level information such as translations in other languages. Both `<def/>` and `<usg/>` elements may appear directly under the `<entry/>`. -### Remarks about structure +### Structural remarks + +Before concluding this description of the *dictionaries* module from the +perspective of someone trying to concretely encode a particular dictionary or +encyclopedia, we make use of the graph approach again to evidence some its +aspects in terms of inclusion structure. + +First, it is remarkable that all elements in the *dictionaries* module have a +cyclic inclusion path, that is to say, there is an inclusion path from each +element of this module to itself. Although having such a cycle is a widespread +property in the remainder of XML-TEI elements shared by 73.9% of them (413 out +of the 559 elements in the other modules), all 31 elements of the *dictionaries* +module having one is far above this average. In addition, the cycles appear to +be rather short, with an average length of 1.96 versus 2.50 in the rest of the +population. This observation is all the more surprising considering the fact +that the *dictionaries* module contains short "leaf" elements like `<pos/>` +which do not obviously require to admit cycles since one rather expects them to +contain only one word, like `<pos>adj</pos>` in the example given in the +official documentation. + +Secondly, although we have seen examples of connections from this module to the +rest of the XML-TEI, especially the *core* module (see the case of the `<ref/>` +element above), the *dictionaries* appears somewhat isolated from important +structural elements like `<head/>` or `<div/>`. Indeed, computing all the paths +from either `<entry/>` or `<sense/>` elements to the latter of length shorter or +equal to 5 by a systematic traversal of the graph yields exclusively paths +(respectively 9042 and 39093 of them) containing either a `<floatingText/>` or +an `<app/>` element. The first one is used to encode + +Thus, despite a rather dense internal connectivity, the *dictionaries* module +fails to provide encoders with a device to represent recursively nesting +structures like `<div/>`. # A new standard ?