Skip to content
Snippets Groups Projects
Commit 5853ebcf authored by Alice Brenon's avatar Alice Brenon
Browse files

Add development about the graph approach used to explore XML-TEI and start...

Add development about the graph approach used to explore XML-TEI and start developing the encoding scheme we propose
parent 00af3a2d
No related branches found
No related tags found
No related merge requests found
......@@ -77,7 +77,7 @@ was entirely reworked, mildly stating that good encyclopedias are difficult to
make because of the amount of knowledge necessary and work needed to keep up
with scientific progress instead of calling the effort a parody. It credits
Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
though quite explicitely to Diderot and d'Alembert's project by naming the
though quite explicitly to Diderot and d'Alembert's project by naming the
collective "Une Société de gens de Lettres" and writing that it started in 1751.
Even more importantly, two new entries were added after it: one for the adjective
"encyclopédique" and another one for the noun "encyclopédiste", silently admitting
......@@ -127,15 +127,16 @@ was to digitize and make *La Grande Encyclopédie* available to the scientific
community as well as the general public. A previous version was partially
available on
[Gallica](https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&version=1.2&collapsing=disabled&query=%28dc.title%20all%20%22La%20Grande%20encyclop%C3%A9die%22%29%20and%20dc.relation%20all%20%22cb377013071%22&rk=42918;4#)
but lacked in quality and had not been fully OCRized.
but lacked in quality and its text had not been fully extracted from the
pictures with an Optical Characters Recognition (OCR) system.
# The *dictionaries* TEI module
Producing *interoperable* and *reusable* data is paramount for them to be useful
for future other scientific projects. These are the two last key aspects of the
in future other scientific projects. These are the two last key aspects of the
[FAIR](https://www.go-fair.org/fair-principles/) principles (*findability*,
*accessibility*, *interoperability* and *reusability*) which we strive to
enforce as a guideline for efficient and quality research. It entails using
follow as a guideline for efficient and quality research. It entails using
standard formats and a standard for encoding historical texts in the context of
digital humanities is XML-TEI, collectively developped by the *Text Encoding
Initiative* consortium. It consists in a set of technical specifications under
......@@ -152,24 +153,83 @@ represent them in a monospace font, in the standard XML autoclosing form within
angle brackets and with a slash following the element name like `<div/>` for a
[`div` element](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html).
We do not mean by this notation that they cannot contain raw text or other XML
elements.
elements, merely that we are referring to such an element, with all the subtree
that spans from it in the context of a concrete document instance or as an empty
structure when we are considering the abstract element and the rules that govern
its use in relation to other elements or its attributes.
## Content
## A graph problem
The XML-TEI specification contains 590 elements, which are each documented on
the consortium's website in the online reference pages. With an average of
almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step. We transform the problem by
representing this network as a directed graph, using elements of XML-TEI as
nodes and placing edges if the destination node may be contained within the
source node according to the schema.
By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
an edge" we define *inclusion paths* which allow us to explore which elements
may be nested under one another. The nodes visited along the way represent the
intermediate XML elements to construct a valid XML tree according to the TEI
schema. Given the top-down semantics of those trees, we call the length of an
inclusion path its *depth*.
Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959)
allows us to explore the shortest inclusion paths that exist between elements.
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides us with an
efficient way to check whether a given element may or not be nested under
another one at all and gives an order of magnitude on the length of the path to
expect. Of course the accuracy of this heuristic decreases as the length of the
elements increases in a perfect graph representing the intended, meaningful path
between two nodes, but this formalism lets us consider elements combinations
rationally and exhaustively by algorithmic means.
For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
through a `<gramGrp/>`. On the other hand, trying to discover the shortest
inclusion path to `<pos/>` from the `<TEI/>` root of the document yields a
`<standOff/>`, an element dedicated to store contextual data that accompanies
but is not part of the text, not unlike an annex, and probably not what we want
in the context of encoding an encyclopedia. A last relevant example on the use
of this approach can be given by querying the shortest inclusion path of a
`<pos/>` under the `<body/>` of the document: it yields an inclusion directly
through `<entryFree/>` (with an inclusion path of length 2), which, unlike
`<entry/>` allows it as a direct child node. Possibly not what we want depending
on the regularity of the articles we are encoding and the existence of other
grammatical information such as `<case/>` or `<gen/>` in languages with an
inflexion system to justify the use of the `<gramGrp/>`, but it gives a good
general idea: `<pos/>` does not need to be nested very deep, it can appear quite
near the "surface" of article entries.
### The `<entry/>` element
The central element of the *dictionaries* module is the `<entry/>` element meant
to encode one single entry in a dictionary, that is to say a head word
associated to its definition. Although it may be contained by `<entryFree/>` or
`<superEntry/>` elements which are respectively tools to relax some constraints
on `<entry/>` elements or to group several of them together, it is the
associated to its definition. It is the natural entry point from the `<body/>`
element to the dictionary module: indeed, although `<body/>` may also contain
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
`<entry/>` while the latter is a device to group several related entries
together. Both can contain an `<entry/` directly while no obvious inclusion
exists the other way around. Most of the inclusion paths of "reasonable" depth
(which we define to strictly inferior to 5, that is twice the average shortest
depth between any two nodes) seem to either include `<figure/>`
Once a block for an article is created
It contain elements useful to represent the features occurring at the begining
of an article such as its written and spoken forms (`<form/>`), a group of
grammatical information (`<gramGrp/>`), that may itself contain as we've seen
above `<case/>`, `<gen/>`, `<number/>` or `<pos/>` to describe the form itself for instance, or `
- hom
- model.entryPart.top
- model.global
- model.ptrLike
- pc
- sense
All these are quite exhaustive and seem general enough to accomodate any book
structure indexing entries by words. A more
# A new standard ?
......@@ -181,44 +241,101 @@ propose a new encoding scheme.
## Nested structures
### The `<entryFree/>` option
### Example
### Candidates in the *dictionaries* module
- `<sense/>`
- `<entryFree/>`
- `<note/>`
## Encoding within the *core* module
The above remark explains why the *dictionary* module by itself is unable to
represent encyclopedias, where discourse with nested structures of arbitrary
depth can occur. Since the *core* module of course accomodates these structures
by means of the `<div/>`, `<head/>` and `<p/>` elements, we devise an encoding
scheme using them which we recommend using for other projects aiming at
representing encyclopedias.
To remain consistent with the above remarks we will only concern ourselves with
what happens at the level of each article, right under the `<body/>` element.
Everything related to metadata happens as expected in the file's `<teiHeader/>`
which is well-enough equiped to handle them. In order to present our scheme
throughout the following section we will be progressively encoding a reference
article, "Cathète" from tome 9.
![La Grande Encyclopédie, tome 9, article "Cathète"](ressources/cathète_t9.png)
### The scheme
Each article is represented by a `<div/>`. We suggest setting an `xml:id`
attribute on it with as value the — unique, or made so by suffixing a number
representing its rank among the various occurrences, even when there's only one
for the sake of regularity — head word of the entry, normalized to lowercase,
stripping spaces and replacing all non-alphanumerical characters by a dash `'-'`
to avoid issues with the XML encoding.
![](snippets/cathète_0.png)
Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
This element should also contain the optional subject indicator within
parenthesis that sometimes accompany the headword, with the appropriate standard
elements like `<persName/>` occurring in biographical articles or `<interp/>`
with a `theme` attribute if the article is given a specific domain in a
taxonomy.
![](snippets/cathète_1.png)
We propose to then wrap each different meaning in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would've
been used within the *core* module. Each sense should be numbered with the `n`
attribute.
![](snippets/cathète_2.png)
In addition, each line within the article must start with a `<lb/>` to mark its
begining including before the `<head/>` element, which, although a surprising
setup, underlines the fact that in the dense layout of encyclopedias, the
carriage return separating two articles is meaningful. Stating each new line
explicitly also keeps enough information to reconstruct a faithful facsimile but
it also has the advantage of highlighting the fact than even though the
definition is cut from the headword by being in a separate XML element, they
still occur on the same line, which is a typographic choice usually made both in
encyclopedias and dictionaries where space is at a premium.
https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-entryFree.html
Finally, the various sections and sub-sections occurring within the article body
may be nested as usual with `<div/>` and sub-`<div/>`s, filled with `<p/>` for
paragraphs which can each be titled with `<head/>` elements local to each
`<div/>`.
- gLike
- model.entryPart
- model.inter
- model.global
- model.morphLike
- model.phrase
![](snippets/cathète_3.png)
diff:
But a typical page of an encyclopedia also features peritext elements, giving
information to the reader about the current page number along with the headwords
of the first and last articles appearing on the page.
- hom
+ gLike
- model.entryPart.top
+ model.entryPart
+ model.inter
- model.ptrLike
+ model.morphLike
+ model.phrase
- pc
- sense
Depending
gLike: glyphs…
model.entryPart: non-morphological elements like usage (<usg/>), collocation
(<colloc/>) or <sense/>
model.inter: inter §
model.morphLike: morphological elements
model.phrase: individual words or phrases (within § so still no)
Moreover, the layout is
often
=> + free text, but still no structure ! (<div/>, <p/>…)
### Currently implemented
## The *core* module
The reference implementation for this encoding scheme is the program `soprano`
developed within the scope of project DISCO-LGE. Though this software is already
useful to segment the text of the encyclopedia into articles and encode them
into XML-TEI, it doesn't yet follow the above specification perfectly. Here is
for instance the encoded version of article "Cathète" currently it produces:
### Implemented
![](snippets/cathète_current.png)
### Left-overs
The headword detection system is not able to capture the subject indicators yet
so it appears outside of the `<head/>` element. Likewise, since the detection of
titles at the begining of each section isn't complete, no structure analysis is
performed on the content of the article
## The constraints of automated processing
......
......@@ -9,11 +9,11 @@ ICHLL_Brenon.pdf: $(DEPEDENCIES)
ICHLL_Brenon.docx: $(DEPEDENCIES)
%.pdf: %.md
pandoc $< -o $@
LANG=fr_FR.UTF-8 pandoc $< -o $@
%.png: %.pdf
pdftocairo -png -singlefile -r 400 $^ $(basename $@)
%.docx: %.md
pandoc $< -o $@
LANG=fr_FR.UTF-8 pandoc $< -o $@
ressources/cathète_t9.png

310 KiB

0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment