to encode respectively the *Petit Larousse Illustré* published by Pierre
to encode respectively the *Petit Larousse Illustré* published by Pierre
Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*, and the
Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*, and the
*Dictionnaire Universel* by Furetière, or rather its second edition edited by
*Dictionnaire Universel* by Furetière, or rather its second version edited by
Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^
Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^
century [@williams2017, p. 1]. These successes suggested it to be a useful tool
century [@williams2017, p. 1]. These successes suggested it to be a useful tool
to encode encyclopedias but a few differences remained between both projects and
to encode encyclopedias but a few differences remained between both projects and
...
@@ -313,7 +313,7 @@ TEI framework could build.
...
@@ -313,7 +313,7 @@ TEI framework could build.
The XML-TEI guidelines graph will hence be defined as follows. One node is
The XML-TEI guidelines graph will hence be defined as follows. One node is
created for each one of the 590 elements found in the specification. Then, an
created for each one of the 590 elements found in the specification. Then, an
edge is placed between source node `A` and destination `B` if the schema states
edge is placed between source node `A` and destination `B` if the schema states
that the element represented by `B` can be contained directly under the element
that the element represented by `B` can be contained directly by the element
represented by `A`. That is, the edges in the graph represent the relation "is
represented by `A`. That is, the edges in the graph represent the relation "is
an admissible direct parent of" (written infix, as in "A is connected to B" if
an admissible direct parent of" (written infix, as in "A is connected to B" if
and only if "A is an admissible direct parent of B"). Please note that the word
and only if "A is an admissible direct parent of B"). Please note that the word
...
@@ -347,8 +347,8 @@ length of an inclusion path will be called its *depth*.
...
@@ -347,8 +347,8 @@ length of an inclusion path will be called its *depth*.
The ability for an element to contain itself corresponds directly to loops on
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
`<entry/>` element on figure \ref{fig:dictionaries-subgraph}: an `<entry/>`
another one.
element (abbreviation) can directly contain another one.
The generalisation of this to inclusion paths of any length greater than one is
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle and it appears natural to refine this and name them
usually called a cycle and it appears natural to refine this and name them
...
@@ -365,7 +365,7 @@ through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
...
@@ -365,7 +365,7 @@ through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
the possible paths will contain `entry-form-pos` and `entry-gramGrp-pos`. It is
the possible paths will contain `entry-form-pos` and `entry-gramGrp-pos`. It is
left to the human encoder to rate the relevance of the path found and to select
left to the human encoder to rate the relevance of the path found and to select
an appropriate one. A total lack of path proves the impossibility of an
an appropriate one. A total lack of path proves the impossibility of an
inclusion; an abnormally high length for the shortest path is a serious hint
inclusion; an abnormally high depth for the shortest path is a serious hint
that the inclusion should not be possible and is not meaningful.
that the inclusion should not be possible and is not meaningful.
Another relevant example of the use of these methods can be given by querying
Another relevant example of the use of these methods can be given by querying
...
@@ -465,8 +465,8 @@ Secondly, although examples of connections from this module to the rest of the
...
@@ -465,8 +465,8 @@ Secondly, although examples of connections from this module to the rest of the
XML-TEI have been evidenced in this section, especially to the *core* module (to
XML-TEI have been evidenced in this section, especially to the *core* module (to
which belongs for example the `<ref/>` element), the *dictionaries* module
which belongs for example the `<ref/>` element), the *dictionaries* module
appears somewhat isolated from important structural elements like `<head/>` or
appears somewhat isolated from important structural elements like `<head/>` or
`<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>`
`<div/>`. Indeed, computing all the paths of length shorter or equal to 5 from
elements to the latter of length shorter or equal to 5 by a systematic traversal
either `<entry/>` or `<sense/>` elements to the latter by a systematic traversal
of the graph yields exclusively paths (respectively 8 943 and 38 649 of them
of the graph yields exclusively paths (respectively 8 943 and 38 649 of them
excluding loops) containing either a `<floatingText/>` or an `<app/>` element.
excluding loops) containing either a `<floatingText/>` or an `<app/>` element.
The first one, as its name aptly suggests, is used to encode text that does not
The first one, as its name aptly suggests, is used to encode text that does not
...
@@ -530,7 +530,7 @@ knowledge, and the occurrence at the beginning of articles, more than a tool to
...
@@ -530,7 +530,7 @@ knowledge, and the occurrence at the beginning of articles, more than a tool to
clear up possible ambiguities also points the reader to the correct place in
clear up possible ambiguities also points the reader to the correct place in
this mind map.
this mind map.
)](ressources/arbre.png){width=300px #fig:systeme-figure}
)](ressources/arbre.png){#fig:systeme-figure}
The situation regarding subject indicators is hardly better outside of the
The situation regarding subject indicators is hardly better outside of the
module. The `<domain/>` element despite its name belongs exclusively in the
module. The `<domain/>` element despite its name belongs exclusively in the
...
@@ -559,7 +559,7 @@ describing their relation to events and other persons comes out even further
...
@@ -559,7 +559,7 @@ describing their relation to events and other persons comes out even further
from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
Figure @fig:sanjo) do not constitute a *definition*.
Figure @fig:sanjo) do not constitute a *definition*.
)](ressources/sanjo_t29.png){#fig:sanjo}
)](ressources/sanjo_t29.png){#fig:sanjo width=65%}
Moreover, encyclopedias, because of all that they have inherited from the
Moreover, encyclopedias, because of all that they have inherited from the
philosophical Enlightenment, are not only spaces designed to assert, they also
philosophical Enlightenment, are not only spaces designed to assert, they also
...
@@ -568,7 +568,7 @@ basis required to understand the complexity of an issue and invite the reader to
...
@@ -568,7 +568,7 @@ basis required to understand the complexity of an issue and invite the reader to
consider it without providing a definitive answer, going as far as to explicitly
consider it without providing a definitive answer, going as far as to explicitly
use question marks as in the article "Action" displayed in Figure @fig:action.
use question marks as in the article "Action" displayed in Figure @fig:action.
)](ressources/action_t1.png){#fig:action}
)](ressources/action_t1.png){#fig:action width=65%}
In this extract, the author devises a hypothetical situation to illustrate how
In this extract, the author devises a hypothetical situation to illustrate how
difficult it is to draw the line between two supposedly mutually exclusive
difficult it is to draw the line between two supposedly mutually exclusive
...
@@ -579,9 +579,9 @@ idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
...
@@ -579,9 +579,9 @@ idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
As a result, the use of `<sense/>` and `<def/>` is not appropriate for
As a result, the use of `<sense/>` and `<def/>` is not appropriate for
encyclopedic content in general.
encyclopedic content in general.
The final difficulty can be considered as a partial consequence of the previous
The final difficulty can be considered a partial consequence of the previous one
one on the structure of articles. The difficulty to define complex concepts is
on the structure of articles. The difficulty to define complex concepts is the
the very reason why authors approach their subjects from various angles,
very reason why authors approach their subjects from various angles,
circumnavigating it as a best approximation. This strategy favours long,
circumnavigating it as a best approximation. This strategy favours long,
structured developments with sections and subsections covering the multiple
structured developments with sections and subsections covering the multiple
aspects of the topic: from a historical, political, scientific point of view…
aspects of the topic: from a historical, political, scientific point of view…
...
@@ -613,23 +613,22 @@ Filtering the content of the module to keep only the elements which can at the
...
@@ -613,23 +613,22 @@ Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>`
same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
It is remarkable that even replacing the `<entry/>` element for the root of each
It is remarkable that even replacing the `<entry/>` element for the root of each
article with an `<entryFree/>`, an element supposed to relax some constraint to
article with an `<entryFree/>`, an element supposed to relax the constraints to
accomodate more unusual structure in dictionaries does not bring any
accomodate more unusual structures in dictionaries does not bring any
improvement.
improvement.
The lack of results from these simple queries forces one to somewhat release the
The lack of results from these simple queries forces one to adopt a less
constraints on the encoding one is willing to use. The occurrence of an
restrictive approach to find an encoding. The occurrence of an intermediate
intermediate element could for instance be needed between the element wrapping
element could for instance be needed between the element wrapping the whole
the whole article and the recursing one used to encode each section. This
article and the recursing one used to encode each section. This "section"
"section" element could also need a companion element to be able to include
element could also need a companion element to be able to include itself, or, to
itself, or, to formalise it in terms of graph theory, the condition that this
formalise it in terms of graph theory, the condition that this element admits a
element admits a loop could be relaxed to consider instead cycles of a given
loop could be relaxed to consider instead cycles of a given (small, this still
(small, this still needs to represent a fairly direct inclusion) length to be
needs to represent a fairly direct inclusion) length to be enough.
enough. Simultaneously the maximum depth of the inclusion paths between
Simultaneously the maximum depth of the inclusion paths between `<entry/>`, the
`<entry/>`, the pair of elements and the `<p/>` element will be increased to
pair of elements and the `<p/>` element will be increased to yield more results.
yield more results.
By setting this depth to 2, that is, by accepting one intermediate element to
By setting this depth to 3, that is, by accepting one intermediate element to
occur in the middle of each one of the inclusion paths that define the structure
occur in the middle of each one of the inclusion paths that define the structure
required to encode encyclopedic discourse, 21 elements can be found, none of
required to encode encyclopedic discourse, 21 elements can be found, none of
which stands out as an obvious good solution: all paths to include the `<p/>`
which stands out as an obvious good solution: all paths to include the `<p/>`
...
@@ -641,9 +640,9 @@ works) or a `<state/>` (used to describe a temporary quality in a person or
...
@@ -641,9 +640,9 @@ works) or a `<state/>` (used to describe a temporary quality in a person or
place), again not even close to what is wanted. The paths to either `<head/>` or
place), again not even close to what is wanted. The paths to either `<head/>` or
`<title/>` are similarly disappointing. Again, changing `<entry/>` for
`<title/>` are similarly disappointing. Again, changing `<entry/>` for
`<entryFree/>` returns the exact same candidates. If that is not a definite
`<entryFree/>` returns the exact same candidates. If that is not a definite
proof that none of these elements could the investigated criteria, it is a fact
proof that none of these elements could meet the investigated criteria, it is a
than no element in this module stands out as the obvious good solution and a
fact than no element in this module stands out as the obvious good solution and
serious hint to keep looking somewhere else.
a serious hint to keep looking somewhere else.
Therefore, the search is extended again to include elements outside the
Therefore, the search is extended again to include elements outside the
*dictionaries* module which could be used to encode the sections and
*dictionaries* module which could be used to encode the sections and
...
@@ -689,7 +688,7 @@ right under the `<body/>` element representing a whole volume. Everything
...
@@ -689,7 +688,7 @@ right under the `<body/>` element representing a whole volume. Everything
related to its metadata happens as expected in the file's `<teiHeader/>` which
related to its metadata happens as expected in the file's `<teiHeader/>` which
is well-enough equiped to handle them. In order to present the scheme throughout
is well-enough equiped to handle them. In order to present the scheme throughout
the following section a reference article, "Cathète" from tome 9 — reproduced in
the following section a reference article, "Cathète" from tome 9 — reproduced in
Figure @fig:cathete-photo — will be progressively encoding.
Figure @fig:cathete-photo — will be encoded step by step.
)](ressources/cathète_t9.png){#fig:cathete-photo}
)](ressources/cathète_t9.png){#fig:cathete-photo}
...
@@ -715,12 +714,13 @@ sharing the same head. Thus, if an oversegmentation or a subsegmentation are
...
@@ -715,12 +714,13 @@ sharing the same head. Thus, if an oversegmentation or a subsegmentation are
fixed (meaning respectively that two "articles" get fusioned or that one
fixed (meaning respectively that two "articles" get fusioned or that one
"article" actually contained several which get split as such) only articles with
"article" actually contained several which get split as such) only articles with
the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this
the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this
choice for the container element on the article "Cathète" previously displayed.
choice for the container element on the article "Cathète" displayed on figure
\ref{fig:cathete-photo}.
{#fig:cathete-xml-0}
{#fig:cathete-xml-0}
Inside this element should be a `<head/>` enclosing the headword of the article.
Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
The usual `<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
highlighted by any special typographic means such as bold, small capitals, etc.
The one disappointment of the encoding scheme being defined in this chapter is
The one disappointment of the encoding scheme being defined in this chapter is
the lack of support for a proper way to encode subject indicators.
the lack of support for a proper way to encode subject indicators.
...
@@ -746,9 +746,9 @@ choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1
...
@@ -746,9 +746,9 @@ choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1
Each different meaning could then be wrapped in a separate `<div/>` with the
Each different meaning could then be wrapped in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would
`type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. The `<div/>`s should be numbered
have been used within the *dictionaries* module. The `<div/>`s should be
according to the order they appear in with the `n` attribute starting from `0`
numbered according to the order they appear in with the `n` attribute starting
as shown in Figure @fig:cathete-xml-2.
from `0`as shown in Figure @fig:cathete-xml-2.
{#fig:cathete-xml-2}
{#fig:cathete-xml-2}
...
@@ -761,7 +761,7 @@ information to reconstruct a faithful facsimile but it also has the advantage of
...
@@ -761,7 +761,7 @@ information to reconstruct a faithful facsimile but it also has the advantage of
highlighting the fact than even though the definition is cut from the headword
highlighting the fact than even though the definition is cut from the headword
by being in a separate XML element, they still occur on the same line, which is
by being in a separate XML element, they still occur on the same line, which is
a typographic choice usually made both in encyclopedias and dictionaries where
a typographic choice usually made both in encyclopedias and dictionaries where
space is at a premium. .
space is at a premium.
To complete the structure, the various sections and subsections occurring
To complete the structure, the various sections and subsections occurring
within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
...
@@ -790,7 +790,7 @@ sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the
...
@@ -790,7 +790,7 @@ sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the
encoding and implicitly suggest that the previous context was not the one of a
encoding and implicitly suggest that the previous context was not the one of a
dictionary which is rather problematic.
dictionary which is rather problematic.
)](ressources/gelocus_t18.png){#fig:gelocus-photo}
)](ressources/gelocus_t18.png){#fig:gelocus-photo width=65%}
{#fig:gelocus-xml}
{#fig:gelocus-xml}
...
@@ -836,27 +836,24 @@ entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
...
@@ -836,27 +836,24 @@ entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
paragraphs are not yet identified and for this reason not encoded.
paragraphs are not yet identified and for this reason not encoded.
However, the figures and their captions are already handled correctly when they
However, the figures and their captions are already handled correctly when they
occur. The encoder also keeps track of the current lines, pages, and columns and
occur. The encoder also keeps track of the current lines, pages, and columns to
inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
insert the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and number
numbers pages so that the numbering corresponding to the physical pages are
pages according to the order of the physical pages in the book, as compared to
available, as compared to the "high-level" pages numbers inserted by the
the "high-level" pages numbers inserted by the editors, which start with an
editors, which start with an offset because the first, blank or almost empty
offset because the first, blank or almost empty pages at the beginning of each
pages at the beginning of each book do not have a number and which sometimes have
book do not have a number and which sometimes have gaps when a full-page
gaps when a full-page geographical map is inserted since those are printed
geographical map is inserted since those are printed separately on a different
separately on a different folio which remains outside of the textual numbering
folio which remains outside of the textual numbering system. The place at which
system. The place at which these layout-related elements occur is determined by
these layout-related elements occur is determined by the place where the OCR
the place where the OCR software detected them and by the reordering performed
software detected them and by the reordering performed by `soprano` when
by `soprano` when inferring the reading order before segmenting the articles.
inferring the reading order before segmenting the articles.
## The constraints of automated processing
## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *EDdA* comprises
containing several tenths of thousands of articles. The *EDdA* comprises over
over 74k articles and *LGE* certainly more than 100k (the latest
74k articles and *LGE* certainly more than 100k (the latest version produced by
version produced by `soprano` created 160k articles, but their segmentation is
`soprano` created 160k articles, but their segmentation is still not perfect).
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an overestimation of the total number).
XML-TEI is a very broad tool useful for very different applications. Some
XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics