to encode respectively the *Petit Larousse Illustré* published by Pierre
Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*, and the
*Dictionnaire Universel* by Furetière, or rather its second edition edited by
*Dictionnaire Universel* by Furetière, or rather its second version edited by
Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^
century [@williams2017, p. 1]. These successes suggested it to be a useful tool
to encode encyclopedias but a few differences remained between both projects and
...
...
@@ -313,7 +313,7 @@ TEI framework could build.
The XML-TEI guidelines graph will hence be defined as follows. One node is
created for each one of the 590 elements found in the specification. Then, an
edge is placed between source node `A` and destination `B` if the schema states
that the element represented by `B` can be contained directly under the element
that the element represented by `B` can be contained directly by the element
represented by `A`. That is, the edges in the graph represent the relation "is
an admissible direct parent of" (written infix, as in "A is connected to B" if
and only if "A is an admissible direct parent of B"). Please note that the word
...
...
@@ -347,8 +347,8 @@ length of an inclusion path will be called its *depth*.
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
another one.
`<entry/>` element on figure \ref{fig:dictionaries-subgraph}: an `<entry/>`
element (abbreviation) can directly contain another one.
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle and it appears natural to refine this and name them
...
...
@@ -365,7 +365,7 @@ through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
the possible paths will contain `entry-form-pos` and `entry-gramGrp-pos`. It is
left to the human encoder to rate the relevance of the path found and to select
an appropriate one. A total lack of path proves the impossibility of an
inclusion; an abnormally high length for the shortest path is a serious hint
inclusion; an abnormally high depth for the shortest path is a serious hint
that the inclusion should not be possible and is not meaningful.
Another relevant example of the use of these methods can be given by querying
...
...
@@ -465,8 +465,8 @@ Secondly, although examples of connections from this module to the rest of the
XML-TEI have been evidenced in this section, especially to the *core* module (to
which belongs for example the `<ref/>` element), the *dictionaries* module
appears somewhat isolated from important structural elements like `<head/>` or
`<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>`
elements to the latter of length shorter or equal to 5 by a systematic traversal
`<div/>`. Indeed, computing all the paths of length shorter or equal to 5 from
either `<entry/>` or `<sense/>` elements to the latter by a systematic traversal
of the graph yields exclusively paths (respectively 8 943 and 38 649 of them
excluding loops) containing either a `<floatingText/>` or an `<app/>` element.
The first one, as its name aptly suggests, is used to encode text that does not
...
...
@@ -530,7 +530,7 @@ knowledge, and the occurrence at the beginning of articles, more than a tool to
clear up possible ambiguities also points the reader to the correct place in
this mind map.
)](ressources/arbre.png){width=300px #fig:systeme-figure}
)](ressources/arbre.png){#fig:systeme-figure}
The situation regarding subject indicators is hardly better outside of the
module. The `<domain/>` element despite its name belongs exclusively in the
...
...
@@ -559,7 +559,7 @@ describing their relation to events and other persons comes out even further
from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
Figure @fig:sanjo) do not constitute a *definition*.
)](ressources/sanjo_t29.png){#fig:sanjo}
)](ressources/sanjo_t29.png){#fig:sanjo width=65%}
Moreover, encyclopedias, because of all that they have inherited from the
philosophical Enlightenment, are not only spaces designed to assert, they also
...
...
@@ -568,7 +568,7 @@ basis required to understand the complexity of an issue and invite the reader to
consider it without providing a definitive answer, going as far as to explicitly
use question marks as in the article "Action" displayed in Figure @fig:action.
)](ressources/action_t1.png){#fig:action}
)](ressources/action_t1.png){#fig:action width=65%}
In this extract, the author devises a hypothetical situation to illustrate how
difficult it is to draw the line between two supposedly mutually exclusive
...
...
@@ -579,9 +579,9 @@ idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
As a result, the use of `<sense/>` and `<def/>` is not appropriate for
encyclopedic content in general.
The final difficulty can be considered as a partial consequence of the previous
one on the structure of articles. The difficulty to define complex concepts is
the very reason why authors approach their subjects from various angles,
The final difficulty can be considered a partial consequence of the previous one
on the structure of articles. The difficulty to define complex concepts is the
very reason why authors approach their subjects from various angles,
circumnavigating it as a best approximation. This strategy favours long,
structured developments with sections and subsections covering the multiple
aspects of the topic: from a historical, political, scientific point of view…
...
...
@@ -613,23 +613,22 @@ Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
It is remarkable that even replacing the `<entry/>` element for the root of each
article with an `<entryFree/>`, an element supposed to relax some constraint to
accomodate more unusual structure in dictionaries does not bring any
article with an `<entryFree/>`, an element supposed to relax the constraints to
accomodate more unusual structures in dictionaries does not bring any
improvement.
The lack of results from these simple queries forces one to somewhat release the
constraints on the encoding one is willing to use. The occurrence of an
intermediate element could for instance be needed between the element wrapping
the whole article and the recursing one used to encode each section. This
"section" element could also need a companion element to be able to include
itself, or, to formalise it in terms of graph theory, the condition that this
element admits a loop could be relaxed to consider instead cycles of a given
(small, this still needs to represent a fairly direct inclusion) length to be
enough. Simultaneously the maximum depth of the inclusion paths between
`<entry/>`, the pair of elements and the `<p/>` element will be increased to
yield more results.
By setting this depth to 3, that is, by accepting one intermediate element to
The lack of results from these simple queries forces one to adopt a less
restrictive approach to find an encoding. The occurrence of an intermediate
element could for instance be needed between the element wrapping the whole
article and the recursing one used to encode each section. This "section"
element could also need a companion element to be able to include itself, or, to
formalise it in terms of graph theory, the condition that this element admits a
loop could be relaxed to consider instead cycles of a given (small, this still
needs to represent a fairly direct inclusion) length to be enough.
Simultaneously the maximum depth of the inclusion paths between `<entry/>`, the
pair of elements and the `<p/>` element will be increased to yield more results.
By setting this depth to 2, that is, by accepting one intermediate element to
occur in the middle of each one of the inclusion paths that define the structure
required to encode encyclopedic discourse, 21 elements can be found, none of
which stands out as an obvious good solution: all paths to include the `<p/>`
...
...
@@ -641,9 +640,9 @@ works) or a `<state/>` (used to describe a temporary quality in a person or
place), again not even close to what is wanted. The paths to either `<head/>` or
`<title/>` are similarly disappointing. Again, changing `<entry/>` for
`<entryFree/>` returns the exact same candidates. If that is not a definite
proof that none of these elements could the investigated criteria, it is a fact
than no element in this module stands out as the obvious good solution and a
serious hint to keep looking somewhere else.
proof that none of these elements could meet the investigated criteria, it is a
fact than no element in this module stands out as the obvious good solution and
a serious hint to keep looking somewhere else.
Therefore, the search is extended again to include elements outside the
*dictionaries* module which could be used to encode the sections and
...
...
@@ -689,7 +688,7 @@ right under the `<body/>` element representing a whole volume. Everything
related to its metadata happens as expected in the file's `<teiHeader/>` which
is well-enough equiped to handle them. In order to present the scheme throughout
the following section a reference article, "Cathète" from tome 9 — reproduced in
Figure @fig:cathete-photo — will be progressively encoding.
Figure @fig:cathete-photo — will be encoded step by step.
)](ressources/cathète_t9.png){#fig:cathete-photo}
...
...
@@ -715,12 +714,13 @@ sharing the same head. Thus, if an oversegmentation or a subsegmentation are
fixed (meaning respectively that two "articles" get fusioned or that one
"article" actually contained several which get split as such) only articles with
the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this
choice for the container element on the article "Cathète" previously displayed.
choice for the container element on the article "Cathète" displayed on figure
\ref{fig:cathete-photo}.
{#fig:cathete-xml-0}
Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
The usual `<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
The one disappointment of the encoding scheme being defined in this chapter is
the lack of support for a proper way to encode subject indicators.
...
...
@@ -746,9 +746,9 @@ choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1
Each different meaning could then be wrapped in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. The `<div/>`s should be numbered
according to the order they appear in with the `n` attribute starting from `0`
as shown in Figure @fig:cathete-xml-2.
have been used within the *dictionaries* module. The `<div/>`s should be
numbered according to the order they appear in with the `n` attribute starting
from `0`as shown in Figure @fig:cathete-xml-2.
{#fig:cathete-xml-2}
...
...
@@ -761,7 +761,7 @@ information to reconstruct a faithful facsimile but it also has the advantage of
highlighting the fact than even though the definition is cut from the headword
by being in a separate XML element, they still occur on the same line, which is
a typographic choice usually made both in encyclopedias and dictionaries where
space is at a premium. .
space is at a premium.
To complete the structure, the various sections and subsections occurring
within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
...
...
@@ -790,7 +790,7 @@ sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the
encoding and implicitly suggest that the previous context was not the one of a
dictionary which is rather problematic.
)](ressources/gelocus_t18.png){#fig:gelocus-photo}
)](ressources/gelocus_t18.png){#fig:gelocus-photo width=65%}
{#fig:gelocus-xml}
...
...
@@ -836,27 +836,24 @@ entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
paragraphs are not yet identified and for this reason not encoded.
However, the figures and their captions are already handled correctly when they
occur. The encoder also keeps track of the current lines, pages, and columns and
inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
numbers pages so that the numbering corresponding to the physical pages are
available, as compared to the "high-level" pages numbers inserted by the
editors, which start with an offset because the first, blank or almost empty
pages at the beginning of each book do not have a number and which sometimes have
gaps when a full-page geographical map is inserted since those are printed
separately on a different folio which remains outside of the textual numbering
system. The place at which these layout-related elements occur is determined by
the place where the OCR software detected them and by the reordering performed
by `soprano` when inferring the reading order before segmenting the articles.
occur. The encoder also keeps track of the current lines, pages, and columns to
insert the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and number
pages according to the order of the physical pages in the book, as compared to
the "high-level" pages numbers inserted by the editors, which start with an
offset because the first, blank or almost empty pages at the beginning of each
book do not have a number and which sometimes have gaps when a full-page
geographical map is inserted since those are printed separately on a different
folio which remains outside of the textual numbering system. The place at which
these layout-related elements occur is determined by the place where the OCR
software detected them and by the reordering performed by `soprano` when
inferring the reading order before segmenting the articles.
## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *EDdA* comprises
over 74k articles and *LGE* certainly more than 100k (the latest
version produced by `soprano` created 160k articles, but their segmentation is
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an overestimation of the total number).
containing several tenths of thousands of articles. The *EDdA* comprises over
74k articles and *LGE* certainly more than 100k (the latest version produced by
`soprano` created 160k articles, but their segmentation is still not perfect).
XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics