Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
I
ICHLL11 Article
Manage
Activity
Members
Labels
Plan
Issues
0
Issue boards
Milestones
Wiki
Code
Merge requests
0
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Alice Brenon
ICHLL11 Article
Commits
a9aa1bcb
Commit
a9aa1bcb
authored
3 years ago
by
Alice Brenon
Browse files
Options
Downloads
Patches
Plain Diff
Restructure the graph-theory approach introduction to air it a bit
parent
0e39df93
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
ICHLL_Brenon.md
+52
-25
52 additions, 25 deletions
ICHLL_Brenon.md
with
52 additions
and
25 deletions
ICHLL_Brenon.md
+
52
−
25
View file @
a9aa1bcb
...
...
@@ -167,20 +167,43 @@ The XML-TEI specification contains 590 elements, which are each documented on
the consortium's website in the online reference pages. With an average of
almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step. We transform the problem by
representing this network as a directed graph, using elements of XML-TEI as
nodes and placing edges if the destination node may be contained within the
source node according to the schema.
combinations sharply increases with each step.
We transform the problem by representing this network as a directed graph, using
elements of XML-TEI as nodes and placing edges if the destination node may be
contained within the source node according to the schema. Please note that the
"element" word is here used with the same meaning as in the TEI documentation to
refer to the conceptual device characterised by a given tag name such as
`p`
or
`div`
and not to a particular instance of them that may occur in a given
document.

### Definitions
By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
an edge" we define
*inclusion paths*
which allow us to explore which elements
may be nested under one another. The nodes visited along the way represent the
intermediate XML elements to construct a valid XML tree according to the TEI
schema. Given the top-down semantics of those trees, we call the length of an
inclusion path its
*depth*
.
may be nested under one another.
The nodes visited along the way represent the intermediate XML elements to
construct a valid XML tree according to the TEI schema. Given the top-down
semantics of those trees, we call the length of an inclusion path its
*depth*
.
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>`
element: an
`<abbr/>`
element (abbreviation) can directly contain
another one.
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle in and we may be tempted in our context to refine this
name them to
*inclusion cycles*
. The
`<address/>`
element provides us with an
example for this configuration: although an
`<address/>`
element may not
directly contain another one, it may contain a
`<geogName/>`
which, in turn, may
contain a new
`<address/>`
element. From a graph theory perspective, we can say
that it admits an inclusion cycle of length two.
### Applications
Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959)
allows us to explore the shortest inclusion paths that exist between elements.
...
...
@@ -190,8 +213,9 @@ efficient way to check whether a given element may or not be nested at all under
another one and gives an order of magnitude on the length of the path to expect.
Of course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build. This
is still very useful when taking into account the fact that TEI modules are
between two nodes that a human specialist of the TEI framework could build.
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
element which cross module boundaries freely. The general graph formalism
...
...
@@ -202,21 +226,24 @@ shortest-path approach is not enough.
For instance, it lets one find that although
`<pos/>`
may not be directly
included within
`<entry/>`
elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
through a
`<form/>`
or a
`<gramGrp/>`
. On the other hand, trying to discover the
shortest inclusion path to
`<pos/>`
from the
`<TEI/>`
root of the document
yields a
`<standOff/>`
, an element dedicated to store contextual data that
accompanies but is not part of the text, not unlike an annex, and widely
unrelated to the context of encoding an encyclopedia. A last relevant example on
the use of these methods can be given by querying the shortest inclusion path of
a
`<pos/>`
under the
`<body/>`
of the document: it yields an inclusion directly
through
`<entryFree/>`
(with an inclusion path of length 2), which unlike
`<entry/>`
accepts it as a direct child node. Possibly not what we want
depending on the regularity of the articles we are encoding and the occurrence
of other grammatical information such as
`<case/>`
or
`<gen/>`
to justify the
use of the
`<gramGrp/>`
, but searching exhaustively for paths up to length 3
returns as expected the path through
`<entry/>`
, among others. Overall, we get a
good general idea:
`<pos/>`
does not need to be nested very deep, it can appear
quite near the "surface" of article entries.
through a
`<form/>`
or a
`<gramGrp/>`
.
On the other hand, trying to discover the shortest inclusion path to
`<pos/>`
from the
`<TEI/>`
root of the document yields a
`<standOff/>`
, an element
dedicated to store contextual data that accompanies but is not part of the text,
not unlike an annex, and widely unrelated to the context of encoding an
encyclopedia.
A last relevant example on the use of these methods can be given by querying the
shortest inclusion path of a
`<pos/>`
under the
`<body/>`
of the document: it
yields an inclusion directly through
`<entryFree/>`
(with an inclusion path of
length 2), which unlike
`<entry/>`
accepts it as a direct child node. Possibly
not what we want depending on the regularity of the articles we are encoding and
the occurrence of other grammatical information such as
`<case/>`
or
`<gen/>`
to
justify the use of the
`<gramGrp/>`
, but searching exhaustively for paths up to
length 3 returns as expected the path through
`<entry/>`
, among others. Overall,
we get a good general idea:
`<pos/>`
does not need to be nested very deep, it
can appear quite near the "surface" of article entries.
### The `<entry/>` element
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment