to encode respectively the *Petit Larousse Illustré* published by Pierre
to encode respectively the *Petit Larousse Illustré* published by Pierre
Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to our target encyclopedia
Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*
and the *Dictionnaire Universel* by Furetière, or rather its second edition
*Dictionnaire Universel* by Furetière, or rather its second edition edited by
edited by Henri Basnage de Beauval, an encyclopedic dictionary from the very
Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^
early 18^th^ century [@williams2017, p. 1]. These successes made it a good starting
century [@williams2017, p. 1]. These successes suggested it to be a useful tool
point for our own encoding but the former does not have the encyclopedic
to encode encyclopedias but a few differences remained between both projects and
dimension our corpus has and the latter is a much older text which had a
DISCO-LGE: the text studied by NENUFAR does not have the encyclopedic dimension
tremendous influence on the european encyclopedic effort of the 18^th^ century
*LGE* has and BASNUM studies a much older text which had a tremendous influence on the
but is not as clearly separated from the dictionaric stem as *La Grande
european encyclopedic effort of the 18^th^ century but is not as clearly
Encyclopédie* is. For these reasons, we could not directly reuse the encoding
separated from the dictionaric stem as *La Grande Encyclopédie* is. For these
schemes used in these projects and had to explore the XML-TEI schema
reasons, the encoding schemes used in these projects could not be reused
systematically to devise our own.
directly, prompting for a systematic exploration of the XML-TEI schema to devise
a new one.
In this chapter, we need to name and manipulate XML elements. We choose to
represent them in a monospace font, in the standard XML autoclosing form within
This chapter discusses XML elements in depth and hence needs to name and
angle brackets and with a slash following the element name like `<div/>` for a
manipulate them. They will be represented in a monospace font, in the standard
`div` element
XML autoclosing form within angle brackets and with a slash following the
([https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html)). We do not mean by this notation that they cannot contain
element name like `<div/>` for a `div` element
raw text or other XML elements, merely that we are referring to such an element,
with all the subtree that spans from it in the context of a concrete document
This notation does not mean to imply that they cannot contain raw text or other
instance or as an empty structure when we are considering the abstract element
XML elements, it merely denotes such an element, without any additional
and the rules that govern its use in relation to other elements or its
assumption. In the context of a concrete document instance this can refer to the
attributes.
markup with all the subtree that possibly spans from it, but the same notation
will be used when considering the abstract element and the rules that govern its
use in relation to other elements or its attributes.
## A graph problem
## A graph problem
...
@@ -249,26 +318,27 @@ almost 80 possible child elements (79.91) within any given element, manually
...
@@ -249,26 +318,27 @@ almost 80 possible child elements (79.91) within any given element, manually
browsing such an massive network can prove quite difficult as the number of
browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step.
combinations sharply increases with each step.
We transform the problem by representing this network as a directed graph, using
The problem can be advantageously transformed by representing this network as a
elements of XML-TEI as nodes and placing edges if the destination node may be
directed graph, using elements of XML-TEI as nodes and placing edges if the
contained within the source node according to the schema. Please note that the
destination node may be contained within the source node according to the
word "element" is here used with the same meaning as in the TEI documentation to
schema. Please note that the word "element" is here used with the same meaning
refer to the conceptual device characterised by a given tag name such as `p` or
as in the TEI documentation to refer to the conceptual device characterised by a
`div` and not to a particular instance of them that may occur in a given
given tag name such as `p` or `div` and not to a particular instance of them
document. Figure @fig:dictionaries-subgraph, by using this transformation to
that may occur in a given document. Figure @fig:dictionaries-subgraph, by using
display the *dictionaries* module, hints at the overall complexity of the whole
this transformation to display the *dictionaries* module, hints at the overall
specification.
complexity of the whole specification.
{height=830px #fig:dictionaries-subgraph}
{height=830px #fig:dictionaries-subgraph}
By iterating several times the operation of moving on that graph along one edge,
By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
that is, by considering the transitive closure of the relation "be connected by
an edge" we define *inclusion paths* which allow us to explore which elements
an edge" one defines*inclusion paths*, allowing to explore which elements may
may be nested under which other.
be nested under which other.
The nodes visited along the way represent the intermediate XML elements to
The nodes visited along the way represent the intermediate XML elements to
construct a valid XML tree according to the TEI schema. Given the top-down
construct a valid XML tree according to the TEI schema. Given the top-down
semantics of those trees, we call the length of an inclusion path its *depth*.
semantics of those trees, the length of an inclusion path will be called its
*depth*.
The ability for an element to contain itself corresponds directly to loops on
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
the graph (that is an edge from a node to itself) as can be illustrated by the
...
@@ -276,17 +346,17 @@ the graph (that is an edge from a node to itself) as can be illustrated by the
...
@@ -276,17 +346,17 @@ the graph (that is an edge from a node to itself) as can be illustrated by the
another one.
another one.
The generalisation of this to inclusion paths of any length greater than one is
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle and we may be tempted in our context to refine this and
usually called a cycle and it appears natural to refine this and name them
name them *inclusion cycles*. The `<address/>` element provides us with an
*inclusion cycles*. The `<address/>` element provides an example for this
example for this configuration: although an `<address/>` element may not
configuration: although an `<address/>` element may not directly contain another
directly contain another one, it may contain a `<geogName/>` which, in turn, may
one, it may contain a `<geogName/>` which, in turn, may contain a new
contain a new `<address/>` element. From a graph theory perspective, we can say
`<address/>` element. From a graph theory perspective, one can say that it
that it admits an inclusion cycle of length two.
admits an inclusion cycle of length two.
Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
allows us to explore the shortest inclusion paths that exist between elements.
lets one explore the shortest inclusion paths that exist between elements.
Though a particular caution should be applied because there is no guarantee that
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides us with an
the shortest path is meaningful in general, it at least provides an
efficient way to check whether a given element may or not be nested at all under
efficient way to check whether a given element may or not be nested at all under
another one and gives a lower bound on the length of the path to expect. Of
another one and gives a lower bound on the length of the path to expect. Of
course the accuracy of this heuristic decreases as the length of the elements
course the accuracy of this heuristic decreases as the length of the elements
...
@@ -297,7 +367,7 @@ This is still very useful when taking into account the fact that TEI modules are
...
@@ -297,7 +367,7 @@ This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
the tools they might need but have no implication on the inclusion paths between
elements which cross module boundaries freely. The general graph formalism
elements which cross module boundaries freely. The general graph formalism
enables us to describe complex filtering patterns and to implement queries to
enables one to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.
shortest-path approach is not enough.
...
@@ -316,12 +386,12 @@ A last relevant example on the use of these methods can be given by querying the
...
@@ -316,12 +386,12 @@ A last relevant example on the use of these methods can be given by querying the
shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
yields an inclusion directly through `<entryFree/>` (with an inclusion path of
yields an inclusion directly through `<entryFree/>` (with an inclusion path of
length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
not what we want depending on the regularity of the articles we are encoding and
not what is wanted depending on the regularity of the articles being encoded and
the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
length 3 returns as expected the path through `<entry/>`, among others. Overall,
length 3 returns as expected the path through `<entry/>`, among others. The big
we get a good general idea: `<pos/>` does not need to be nested very deep, it
picture starts to appear: `<pos/>` does not need to be nested very deep, it can
can appear quite near the "surface" of article entries.
appear quite near the "surface" of article entries.
## Content of the module
## Content of the module
...
@@ -333,15 +403,15 @@ element to the dictionary module: indeed, although `<body/>` may also contain
...
@@ -333,15 +403,15 @@ element to the dictionary module: indeed, although `<body/>` may also contain
`<entry/>` while the latter is a device to group several related entries
`<entry/>` while the latter is a device to group several related entries
together. Both can contain an `<entry/` directly while no obvious inclusion
together. Both can contain an `<entry/` directly while no obvious inclusion
exists the other way around: most (> 96.2%) of the inclusion paths of
exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which we define as strictly inferior to 5, that is twice the
"reasonable" depth (which will be arbitrarily defined as strictly inferior to 5,
average shortest depth between any two nodes) either include`<figure/>` or
that is twice the average shortest depth between any two nodes) either include
`<castList/>`, two very specific elements which should not need to appear in an
`<figure/>` or `<castList/>`, two very specific elements which should not need
article in general, showing that the purpose of `<entry/>` is not to contain an
to appear in an article in general, showing that the purpose of `<entry/>` is
`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
not to contain an `<entryFree/>` or `<superEntry/>`. Hence, not only the
documentation but also the structure of the elements graph evidence `<entry/>`
semantics conveyed by the documentation but also the structure of the elements
as the natural top-most element for an article. This somewhat contrived example
graph evidence `<entry/>`as the natural top-most element for an article. This
hopes to further demonstrate the application of a graph-centred approach to
somewhat contrived example hopes to further demonstrate the application of a
understand the inner workings of the XML-TEI schema.
graph-centred approach to understand the inner workings of the XML-TEI schema.
Once a block for an article is created, it may contain elements useful to
Once a block for an article is created, it may contain elements useful to
represent various of its features. Its written and spoken forms are usually
represent various of its features. Its written and spoken forms are usually
...
@@ -370,7 +440,7 @@ redirection, with an imperative locution like "please see […]".
...
@@ -370,7 +440,7 @@ redirection, with an imperative locution like "please see […]".
The "active" part of the cross-reference, that is the very word within the
The "active" part of the cross-reference, that is the very word within the
`<xr/>` that is considered to be the link or, to make a modern-day HTML
`<xr/>` that is considered to be the link or, to make a modern-day HTML
metaphor, the region that would be clickable, is represented by a `<ref/>`
metaphor, the region that would be clickable, is represented by a `<ref/>`
element. Though it is not specific to the *dictionaries* module, we include it
element. Though it is not specific to the *dictionaries* module, it is included
in this description of the toolbox because it is particularly useful in the
in this description of the toolbox because it is particularly useful in the
context of dictionaries. This element may have a target attribute which points
context of dictionaries. This element may have a target attribute which points
to the other resource to be accessed by the interested reader.
to the other resource to be accessed by the interested reader.
...
@@ -387,7 +457,7 @@ under the `<entry/>`.
...
@@ -387,7 +457,7 @@ under the `<entry/>`.
Before concluding this description of the *dictionaries* module from the
Before concluding this description of the *dictionaries* module from the
perspective of someone trying to concretely encode a particular dictionary or
perspective of someone trying to concretely encode a particular dictionary or
encyclopedia, we make use of the graph approach again to evidence some its
encyclopedia, the graph approach is again leveraged to evidence some of its
aspects in terms of inclusion structure.
aspects in terms of inclusion structure.
First, it is remarkable that all elements in the *dictionaries* module have a
First, it is remarkable that all elements in the *dictionaries* module have a
...
@@ -405,25 +475,25 @@ official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
...
@@ -405,25 +475,25 @@ official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
element made to group quotations with a bibliographic reference to their source
element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.
which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the
Secondly, although examples of connections from this module to the rest of the
rest of the XML-TEI, especially to the *core* module (to which belongs for
XML-TEI have been evidenced in this section, especially to the *core* module (to
example the `<ref/>` element), the *dictionaries* module appears somewhat
which belongs for example the `<ref/>` element), the *dictionaries* module
isolated from important structural elements like `<head/>` or`<div/>`. Indeed,
appears somewhat isolated from important structural elements like `<head/>` or
computing all the paths from either `<entry/>` or `<sense/>` elements to the
`<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>`
latter of length shorter or equal to 5 by a systematic traversal of the graph
elements to the latter of length shorter or equal to 5 by a systematic traversal
yields exclusively paths (respectively 9042 and 39093 of them) containing either
of the graph yields exclusively paths (respectively 9042 and 39093 of them)
a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly
containing either a `<floatingText/>` or an `<app/>` element. The first one, as
suggests, is used to encode text that does not quite fit the regular flow of the
its name aptly suggests, is used to encode text that does not quite fit the
document, as for example in the context of an embedded narrative. Both examples
regular flow of the document, as for example in the context of an embedded
displayed in the online documentation feature a`<body/>` as direct child of
narrative. Both examples displayed in the online documentation feature a
`<floatingText/>`, neatly separating its content as independent. The purpose of
`<body/>` as direct child of `<floatingText/>`, neatly separating its content as
the second one, although its name — short for apparatus — is less clear, is to
independent. The purpose of the second one, although its name — short for
wrap together several versions of the same excerpts, for instance when there are
apparatus — is less clear, is to wrap together several versions of the same
several possible readings of an unclear group of words in a manuscript, or when
excerpts, for instance when there are several possible readings of an unclear
the encoder is trying to compile a single version of a piece of work from
group of words in a manuscript, or when the encoder is trying to compile a
several sources which disagree over some passage. In both case, it appears
single version of a piece of work from several sources which disagree over some
obvious that it is not something that is expected to occur naturally in the
passage. In both case, it appears obvious that it is not something that is
course of an article in general.
expected to occur naturally in the course of an article in general.
Thus, despite a rather dense internal connectivity, the *dictionaries* module
Thus, despite a rather dense internal connectivity, the *dictionaries* module
fails to provide encoders with a device to represent recursively nesting
fails to provide encoders with a device to represent recursively nesting
...
@@ -432,21 +502,21 @@ structures like `<div/>`.
...
@@ -432,21 +502,21 @@ structures like `<div/>`.
# A new standard ?
# A new standard ?
Studying the content of *La Grande Encyclopédie* and considering several
Studying the content of *La Grande Encyclopédie* and considering several
articles in particular, we identify structures which are specific to
articles in particular, one can identify structures which are specific to
encyclopedias and not compatible with the *dictionaries* module presented in the
encyclopedias and not compatible with the *dictionaries* module presented in the
previous section. We hence conclude that this module is not able to encode
previous section. It follows that this module is not able to encode arbitrary
arbitrary encyclopedic content and propose a new fully TEI-compliant encoding
encyclopedic content and propose a new fully TEI-compliant encoding scheme
scheme remaining outside of it. We proceed with remarks about the needs of
remaining outside of it. The rest of the section is concerned with the needs of
automated encoding processes and compare our proposal with other strategies to
automated encoding processes and compares the proposal with other strategies to
overcome the issues previously identified with the dedicated module for
overcome the issues previously identified with the dedicated module for
dictionaries.
dictionaries.
## Idiosynchrasies of encyclopedias
## Idiosynchrasies of encyclopedias
Browsing through the pages of an encyclopedia reveals a certain number of
Browsing through the pages of an encyclopedia reveals a certain number of
noticeable differences. It is difficult to make a precise list because the
noticeable differences. A comprehensive list would be difficult to draw because
editorial choices may vary greatly between encyclopedias but we discuss some of
of the great variety in terms of editorial choices the most obvious can be
the most obvious.
discussed.
The first immediately visible feature that sets encyclopedias apart from
The first immediately visible feature that sets encyclopedias apart from
dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
...
@@ -456,24 +526,24 @@ system. Those generally cover a broad range of subjects from scientific
...
@@ -456,24 +526,24 @@ system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law.
disciplines to litterature, and extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose
No element in the *dictionaries* module is explicitely designed for the purpose
of encoding these indicators. As we have seen, the elements set is geared
of encoding these indicators. As section @sec:dictionaries-module illustrates,
towards the words themselves instead of the concept they represent. The closest
the elements set is geared towards the words themselves instead of the concept
tool for what we need is found in the `<usg/>` element used with a specific
they represent. The tool closest to what is neededcan be found in the `<usg/>`
`type` attribute set to `dom` for "domain". Indeed several examples from the
element used with a specific `type` attribute set to `dom` for "domain". Indeed
documentation encode subject indicators very similar to the ones found in
several examples from the documentation encode subject indicators very similar
encyclopedias within this element, but the match is not perfect either: all
to the ones found in encyclopedias within this element, but the match is not
appear within one of multiple senses, as if to clarify each context in which the
perfect either: all appear within one of multiple senses, as if to clarify each
word can be used, as expected from the element's name, "usage". In
context in which the word can be used, as expected from the element's name,
encyclopedias, if the domain indicator does in certain cases help to distinguish
"usage". In encyclopedias, if the domain indicator does in certain cases help to
between several entries sharing the same headword, the concept itself has
distinguish between several entries sharing the same headword, the concept
evolved beyond this mere distinction. Looking back at the*Encyclopédie*, the
itself has evolved beyond this mere distinction. Looking back at the
adjective *raisonné* in the rest of the title directly introduces a notion of
*Encyclopédie*, the adjective *raisonné* in the rest of the title directly
structure that links back to the "Systême figuré des connoissances humaines"
introduces a notion of structure that links back to the "Systême figuré des
[@blanchard2002, p. 1] which schematic structure is shown in Figure
connoissances humaines" [@blanchard2002, p. 1] which schematic structure is
@fig:systeme-figure. The authors have devised a branching system to classify all
shown in Figure @fig:systeme-figure. The authors have devised a branching system
knowledge, and the occurrence at the beginning of articles, more than a tool to
to classify all knowledge, and the occurrence at the beginning of articles, more
clear up possible ambiguities also points the reader to the correct place in
than a tool to clear up possible ambiguities also points the reader to the
this mind map.
correct place in this mind map.
)](ressources/arbre.png){width=300px #fig:systeme-figure}
)](ressources/arbre.png){width=300px #fig:systeme-figure}
...
@@ -537,17 +607,17 @@ which are in turn generally developed over several paragraphs.
...
@@ -537,17 +607,17 @@ which are in turn generally developed over several paragraphs.
)](ressources/europe_t16.png){#fig:europe}
)](ressources/europe_t16.png){#fig:europe}
The nested structure that we have just evidenced demands of course a nesting
The nested structure that have just been evidenced demands of course a nesting
structure to accomodate it. More precisely it guides our search of XML elements
structure to accomodate it. More precisely, it guides the search of XML elements
by giving us several constraints: we are looking for a pair of elements, the
by adding several constraints: what is required is a pair of elements. The first
first representing a (sub)section must be able to include both itself and the
one representing a (sub)section must be able to include both itself and the
second element, which does not have any special constraint except the one to
second one, which does not have any special constraint except the one to have a
have a semantics compatible with our purpose of using it to represent section
semantics compatible with the purpose of being used to represent section titles.
titles. In addition, the first element must be able to contain several `<p/>`
In addition, the first element must be able to contain several `<p/>` elements,
elements, `<p/>` being the reference element to encode paragraphs according to
`<p/>` being the reference element to encode paragraphs according to the XML-TEI
the XML-TEI documentation.
documentation.
We have seen that the *dictionaries* module was equiped with a questionable but
The *dictionaries* module has been shown to be equiped with a questionable but
possible element for subject domains. However, it does not include any element
possible element for subject domains. However, it does not include any element
for section titles. In the rest of the TEI specification, the elements `<head/>`
for section titles. In the rest of the TEI specification, the elements `<head/>`
and `<title/>` — the latter with the possibility to set its `type` attribute to
and `<title/>` — the latter with the possibility to set its `type` attribute to
...
@@ -562,41 +632,42 @@ article with an `<entryFree/>`, an element supposed to relax some constraint to
...
@@ -562,41 +632,42 @@ article with an `<entryFree/>`, an element supposed to relax some constraint to
accomodate more unusual structure in dictionaries does not bring any
accomodate more unusual structure in dictionaries does not bring any
improvement.
improvement.
The lack of results from these simple queries forces us to somewhat release the
The lack of results from these simple queries forces one to somewhat release the
constraints on the encoding we are willing to use. We can for instance make the
constraints on the encoding one is willing to use. The occurrence of an
asumption that the occurrence of an intermediate element could be needed between
intermediate element could for instance be needed between the element wrapping
the element wrapping the whole article and the recursing one used to encode each
the whole article and the recursing one used to encode each section. This
section. This "section" element could also need a companion element to be able
"section" element could also need a companion element to be able to include
to include itself, or, to formalise it in terms of graph theory, we could relax
itself, or, to formalise it in terms of graph theory, the condition that this
the condition that this element admits a loop to consider instead cycles of a
element admits a loop could be relaxed to consider instead cycles of a given
given (small, this still needs to represent a fairly direct inclusion) length to
(small, this still needs to represent a fairly direct inclusion) length to be
be enough. We simultaneously extend the maximum depth of the inclusion paths we
enough. Simultaneously the maximum depth of the inclusion paths between
are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
`<entry/>`, the pair of elements and the `<p/>` element will be increased to
yield more results.
By setting this depth to 3, that is, by accepting one intermediate element to
By setting this depth to 3, that is, by accepting one intermediate element to
occur in the middle of each one of the inclusion paths that define the structure
occur in the middle of each one of the inclusion paths that define the structure
required to encode encyclopedic discourse, we find 21 elements but none of them
required to encode encyclopedic discourse, 21 elements can be found, none of
stand out as an obvious good solution: all paths to include the `<p/>` element
which stands out as an obvious good solution: all paths to include the `<p/>`
from any *dictionaries* element either contains a `<figure/>` (which we have
element from any *dictionaries* element either contains a `<figure/>` (already
encountered earlier when we were practising our graph approach to search for
discussed in section @sec:dictionaries-module when practising the graph approach
inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
to search for inclusions between `<entry/>` and `<entryFree/>` and dismissed as
general), a `<stage/>` (reserved to stage direction in dramatic works) or a
not useful in general), a `<stage/>` (reserved to stage direction in dramatic
`<state/>` (used to describe a temporary quality in a person or place), again
works) or a `<state/>` (used to describe a temporary quality in a person or
not even close to what we want. The paths to either `<head/>` or`<title/>` are
place), again not even close to what is wanted. The paths to either `<head/>` or
`<title/>` are similarly disappointing. Again, changing `<entry/>` for
the exact same candidates. If that is not a thorough proof that none of these
`<entryFree/>` returns the exact same candidates. If that is not a definite
elements could fulfill our purpose, it is a fact than no element in this module
proof that none of these elements could the investigated criteria, it is a fact
appears as an obvious good solution and a serious hint to keep looking somewhere
than no element in this module stands out as the obvious good solution and a
else.
serious hint to keep looking somewhere else.
We hence widen our search to include elements outside the*dictionaries* module
Therefore, the search is extended again to include elements outside the
which could be used to encode our sections and subsections, under the same
*dictionaries* module which could be used to encode the sections and
constraint as before to try and find a composite solution that would remain
subsections, under the same constraint as before to try and find a composite
under the `<entry/>` element even if resorting to subcomponents outside of the
solution that would remain under the `<entry/>` element even if resorting to
dedicated module. Only three elements are returned:`<figure/>`, `<metamark/>`
subcomponents outside of the dedicated module. Only three elements are returned:
and `<note/>`.
`<figure/>`, `<metamark/>`and `<note/>`.
The first one as we have repeatedly underlined is meant for graphic information
The first one as has been repeatedly underlined is meant for graphic information
and is not suitable for text content in general.
and is not suitable for text content in general.
The purpose of `<metamark/>` is to transcribe the edition marks than may appear
The purpose of `<metamark/>` is to transcribe the edition marks than may appear
...
@@ -605,14 +676,14 @@ suggest an alternative reading (deletion, insertion, reordering, this is about a
...
@@ -605,14 +676,14 @@ suggest an alternative reading (deletion, insertion, reordering, this is about a
human editing the text from a given physical copy of it), but it is
human editing the text from a given physical copy of it), but it is
unfortunately of no use to encode a section of an article.
unfortunately of no use to encode a section of an article.
The first element that might at least resemble what we are looking for is the
The first element that might at least seem acceptable is the last one,
last one, `<note/>`. It is meant to contain text, is about explaning something
`<note/>`. It is meant to contain text, is about explaning something and seems
and seems general enough (not specific to a given genre, or to the occurrence of
general enough (not specific to a given genre, or to the occurrence of a
a particular object on the page). Unfortunately, its semantics still seems a bit
particular object on the page). Unfortunately, its semantics still seems a bit
off compared to our need. The documentation describes it as an "additional
off compared to what is required. The documentation describes it as an
comment" which appears "out of the main textual stream" whereas the long
"additional comment" which appears "out of the main textual stream" whereas the
developments in articles are the very matter of the text of encyclopedias, not
long developments in articles are the very matter of the text of encyclopedias,
mere remarks in the margins or at the foot of pages.
not mere remarks in the margins or at the foot of pages.
## Encoding within the *core* module {#sec:core-module}
## Encoding within the *core* module {#sec:core-module}
...
@@ -620,63 +691,75 @@ The remarks made in section @sec:dictionaries-module explain why the
...
@@ -620,63 +691,75 @@ The remarks made in section @sec:dictionaries-module explain why the
*dictionary* module is unable to represent encyclopedias, where the notion of
*dictionary* module is unable to represent encyclopedias, where the notion of
"meaning" is less central that in dictionaries and where discourse with nested
"meaning" is less central that in dictionaries and where discourse with nested
structures of arbitrary depth can occur. Even composite encodings using elements
structures of arbitrary depth can occur. Even composite encodings using elements
outside of the *dictionaries* module under an `<entry/>` element do not meet our
outside of the *dictionaries* module under an `<entry/>` element do not meet the
requirements. Since the *core* module obviously accomodates these structures by
requirements of the project. Since the *core* module obviously accomodates these
means of the `<div/>`, `<head/>` and `<p/>` elements which have the additional
structures by means of the `<div/>`, `<head/>` and `<p/>` elements which have
advantage of carrying less semantical payload than `<sense/>` or`<def/>` we
the additional advantage of carrying less semantical payload than `<sense/>` or
devise an encoding scheme using them which we recommend using for other projects
`<def/>`, these elements will be used to devise an encoding scheme which can be
aiming at representing encyclopedias.
recommended for other projects aiming at representing encyclopedias.
To remain consistent with the way we studied the *dictionaries* module we will
To remain consistent with the way the *dictionaries* module was studied only
only concern ourselves with what happens at the level of each article, right
what happens at level of each individual article will be considered, that is
under the `<body/>` element. Everything related to metadata happens as expected
right under the `<body/>` element representing a whole volume. Everything
in the file's `<teiHeader/>` which is well-enough equiped to handle them. In
related to its metadata happens as expected in the file's `<teiHeader/>` which
order to present our scheme throughout the following section we will be
is well-enough equiped to handle them. In order to present the scheme throughout
progressively encoding a reference article, "Cathète" from tome 9 reproduced in
the following section a reference article, "Cathète" from tome 9 — reproduced in
Figure @fig:cathete-photo.
Figure @fig:cathete-photo — will be progressively encoding.
)](ressources/cathète_t9.png){#fig:cathete-photo}
)](ressources/cathète_t9.png){#fig:cathete-photo}
Remaining within the *core* module for the structure, almost all useful elements
Remaining within the *core* module for the structure, almost all useful elements
are available and our encoding scheme merely quotes the official documentation.
are available and practically no additional documentation is needed beyond the
Each article is represented by a `<div/>`. We suggest setting an `xml:id`
official TEI guidelines. Each article is represented by a `<div/>`. Setting an
attribute on it with the head word of the entry — unique in the whole corpus, or
`xml:id` attribute on it with a unique value will ease identify, browse and
made so by suffixing a number representing its rank among the various
retrieve the articles from the encoded corpus. An auto-increasing serial would
occurrences, even when there's only one for the sake of regularity — as its
of course provide an appropriate value for such a unique attribute but has some
value, normalised to lowercase, stripping spaces and replacing all
drawbacks: as long as the articles segmentation isn't fixed (which could happen
if choices regarding entries and sub-entries were to change along a project or
if, as is the case of DISCO-LGE, the automatic segmentation went through
successive improvement steps), the identifiers of articles would massively
change from one version to the other, even articles segmented correctly. Given
the iterative nature of many studies in digital humanities, this would make it
harder to use results found early in a project. For this reason, the values used
for `xml:id` in project DISCO-LGE depend only on the local quality of the
segmentation and remain globally stable. They are computed as the head word of
the entries normalised to lowercase, stripping spaces and replacing all
non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
encoding, and suffixed by a serial to distinguish between the few entries
element on the article "Cathète" previously displayed.
sharing the same head. Thus, if an oversegmentation or a subsegmentation are
fixed (meaning respectively that two "articles" get fusioned or that one
"article" actually contained several which get split as such) only articles with
the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this
choice for the container element on the article "Cathète" previously displayed.
{#fig:cathete-xml-0}
{#fig:cathete-xml-0}
Inside this element should be a `<head/>` enclosing the headword of the article.
Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
highlighted by any special typographic means such as bold, small capitals, etc.
The one disappointment of the encoding scheme we are defining in this chapter is
The one disappointment of the encoding scheme being defined in this chapter is
the lack of support for a proper way to encode subject indicators.
the lack of support for a proper way to encode subject indicators.
The best candidate we have found so far was `<usg/>` from the *dictionaries*
The best candidate found so far was `<usg/>` from the *dictionaries* module but
module but it is not available directly under a `<head/>` element. All inclusion
it is not available directly under a `<head/>` element. All inclusion paths from
paths from the latter to the former of length less than or equal to 3 contain
the latter to the former of length less than or equal to 3 contain irrelevant
irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it must be
must be discarded. The next best elements appear to be `<term/>` (not very
discarded. The next best elements appear to be `<term/>` (not very accurate) and
accurate) and `<rs/>` ("referring string", quite a general semantics but a
`<rs/>` ("referring string", quite a general semantics but a possible match —
possible match — subject indicators refer to a given domain of knowledge —
subject indicators refer to a given domain of knowledge — although all the
although all the examples in the documentation refer to concrete persons,
examples in the documentation refer to concrete persons, places or object, not
places or object, not to the abstract objects that mathematics or poetry are).
to the abstract objects that mathematics or poetry are).
For this reason, we do not recommend any special encoding of the subject
For this reason, no particular encoding of the subject indicator is recommended
indicator but leave it open to each particular context: they are often
and it is left open to each particular context: they are often abbreviated so an
abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
`<abbr/>` may apply, in *La Grande Encyclopédie*, biographies are not labeled by
are not labeled by a knowledge domain but usually include the first name of the
a knowledge domain but usually include the first name of the person when it is
person when it is known so in that case an element like `<persName/>` is still
known so in that case an element like `<persName/>` is still appropriate. This
appropriate. This choice applied to the same article "Cathète" produces Figure
choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1.
@fig:cathete-xml-1.
{#fig:cathete-xml-1}
{#fig:cathete-xml-1}
We then propose to wrap each different meaning in a separate `<div/>` with the
Each different meaning could then be wrapped in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would
`type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. The `<div/>`s should be numbered
have been used within the *core* module. The `<div/>`s should be numbered
according to the order they appear in with the `n` attribute starting from `0`
according to the order they appear in with the `n` attribute starting from `0`
...
@@ -711,16 +794,16 @@ Figure @fig:boumerang-photo, which should be encoded the standard way by
...
@@ -711,16 +794,16 @@ Figure @fig:boumerang-photo, which should be encoded the standard way by
{#fig:boumerang-xml}
{#fig:boumerang-xml}
Another issue arising from giving up on `<entry/>` is the unavailability of the
Another issue arising from giving up on `<entry/>` is the unavailability of the
`<xr/>` element, not allowed under any of the *core* elements we use but which
`<xr/>` element, not allowed under any of the *core* elements used but which is
is useful to represent cross-references occurring in encyclopedias as well as in
useful to represent cross-references occurring in encyclopedias as well as in
dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
We prefer to use the `<ref/>` element instead which is available in the context
It is prefered to use the `<ref/>` element instead which is available in the
of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
context of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
Another solution would have been to introduce a `<dictScrap/>` element for the
Another solution would have been to introduce a `<dictScrap/>` element for the
sole purpose of placing an `<xr/>` but we advocate against it on account of the
sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the
verbosity it would add to the encoding and the fact that it implicitly suggests
encoding and implicitly suggest that the previous context was not the one of a
that the previous context was not the one of a dictionary.
dictionary which is rather problematic.
)](ressources/gelocus_t18.png){#fig:gelocus-photo}
)](ressources/gelocus_t18.png){#fig:gelocus-photo}
...
@@ -739,7 +822,7 @@ the text, like the beginning of a new column of text or of a new page. Figure
...
@@ -739,7 +822,7 @@ the text, like the beginning of a new column of text or of a new page. Figure
@fig:alcala-photo shows the top left of the last page of the first tome of *La
@fig:alcala-photo shows the top left of the last page of the first tome of *La
Grande Encyclopédie* which features peritext elements while marking the
Grande Encyclopédie* which features peritext elements while marking the
beginning of a new page. The usual appropriate elements (`<pb/>` for page
beginning of a new page. The usual appropriate elements (`<pb/>` for page
beginning, `<cb/>` for column beginning) may and should be used with our
beginning, `<cb/>` for column beginning) may and should be used with this
encoding scheme as demonstrated by Figure @fig:alcala-xml.
encoding scheme as demonstrated by Figure @fig:alcala-xml.
)](ressources/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
)](ressources/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
...
@@ -752,8 +835,8 @@ developed within the scope of project DISCO-LGE to automatically identify
...
@@ -752,8 +835,8 @@ developed within the scope of project DISCO-LGE to automatically identify
individual articles in the flow of raw text from the columns and to encode them
individual articles in the flow of raw text from the columns and to encode them
into XML-TEI files. Though this software has already been used to produce the
into XML-TEI files. Though this software has already been used to produce the
first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
the specification we have just described. Figure @fig:cathete-xml-current shows
the specification described in this chapter. Figure @fig:cathete-xml-current
the encoded version of article "Cathète" it currently produces:
shows the encoded version of article "Cathète" it currently produces:
{#fig:cathete-xml-current}
{#fig:cathete-xml-current}
...
@@ -797,7 +880,7 @@ information (for the second one, adjacent to a notion as elusive as truth)
...
@@ -797,7 +880,7 @@ information (for the second one, adjacent to a notion as elusive as truth)
which requires a very deep understanding of a text in its entirety and about
which requires a very deep understanding of a text in its entirety and about
which even some human experts may disagree.
which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
For these reasons, a central concern in the design of an encoding scheme was to
remain within the boundaries of information that can be described objectively
remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented in
and extracted automatically by an algorithm. Most of the tags presented in
section @sec:core-module contain information about the positions of the elements
section @sec:core-module contain information about the positions of the elements
...
@@ -806,30 +889,29 @@ like `<head/>` can be inferred simply from their position and the frequent use
...
@@ -806,30 +889,29 @@ like `<head/>` can be inferred simply from their position and the frequent use
of a special typography like bold or upper-case characters.
of a special typography like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example
The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of
to the main principle on which this scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In *La Grande
token which invites the reader to perform the redirection. In *La Grande
Encyclopédie*, virtually all the redirections (that is, to the extent of our
Encyclopédie*, virtually all the redirections appear within parenthesis (at
knowledge, absolutely all of them though of course some special case may exist,
least no counter-example has been found within the scope of the project), and
but they are statistically rare enough that we have not found any yet) appear
start with the verb "voir" abbreviated as a single, capital "V." as illustrated
within parenthesis, and start with the verb "voir" abbreviated as a single,
in the article "Gelocus" (see again Figure @fig:gelocus-photo).
capital "V." as illustrated in the article "Gelocus" (see again Figure
@fig:gelocus-photo).
Although this has not been implemented yet either, being able to detect and
exploit those patterns to correctly encode cross-references does not pose any
Although this has not been implemented yet either, we hope to be able to detect
fundamental theoretical problem and should be achievable. Getting the `target`
and exploit those patterns to correctly encode cross-references. Getting the
attributes right is certainly more difficult to achieve and may require
`target` attributes right is certainly more difficult to achieve and may require
processing the articles in several steps, to first discover all the existing
processing the articles in several steps, to first discover all the existing
headwords — and hence article IDs — before trying to match the words following
headwords — and hence article IDs — before trying to match the words following
"V." with them. Since our automated encoder handles tomes separately and since
"V." with them. Since the automated encoder implemented in the project handles
references may cross the boundaries of tomes, it cannot wait for the target of a
tomes separately and since references may cross the boundaries of tomes, it
cross-reference to be discovered by keeping the articles in memory before
cannot wait for the target of a cross-reference to be discovered by keeping the
outputting them.
articles in memory before outputting them.
This is in line with the last important aspect of our encoder. If many
This is in line with the last important aspect of the encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not
lexicographers may deem this encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can. This
algorithm implementing it in `soprano` outputs elements as soon as it can. This
is immediate for simple elements such as `<pb/>` or `<fw/>`; for articles, it
is immediate for simple elements such as `<pb/>` or `<fw/>`; for articles, it
...
@@ -843,50 +925,55 @@ lowered to around forty minutes on a machine with 16Go of RAM for the whole of
...
@@ -843,50 +925,55 @@ lowered to around forty minutes on a machine with 16Go of RAM for the whole of
## Comparison to other approaches
## Comparison to other approaches
The previous section about the structure of the *dictionaries* module and the
The previous section about the structure of the *dictionaries* module and the
features found in encyclopedias follows quite closely our own journey trying to
features found in encyclopedias follows reflects the issues which have arised
encode first manually then by automatic means the articles of our corpus. This
along the course of the project while trying to encode first manually and then
back and forth between trying to find patterns in the graph which reflects the patterns
by automatic means the articles of its corpus. This back and forth between
found in the text and questioning the relevance of the results explains the
trying to find patterns in the graph which reflects the patterns found in the
choice we ended up making but also the alternatives we have considered.
text and questioning the relevance of the results explains the choice advocated
in this chapter but also the alternatives considered.
Several times, the issue of the semantics of some elements which posess the
properties we need came up. This is the case for instance of the `<sense/>` and
Several elements exhibited some interesting properties, having for instance some
`<node/>` elements. It is very tempting to bend their documented semantics or to
interesting inclusion path corresponding to the structure needed to represent
consider that their inclusion properties is part of what defines them, and hence
the nested structure of articles. This is the case for instance of the
justifies their ways in creative ways not directly recommended by the TEI
`<sense/>` and `<note/>` elements. It is very tempting to bend their documented
specifications.
semantics or to consider that their inclusion properties is part of what defines
them, and hence justifies their ways in creative ways not directly recommended
This is the approach followed by project BASNUM[^BASNUM]. In the articles
by the TEI specifications.
encoded for this project, `<note/>` elements are nested and used to structure
the encyclopedic developments that occur in the articles.
This is the approach followed by project BASNUM (see section
@sec:starting-point). In the articles encoded for this project, `<note/>`
We have chosen not to follow the same path in the name of the FAIR principles to
elements are nested and used to structure the encyclopedic developments that
avoid the emergence of a custom usage differing from the documented one.
occur in the articles.
The other major reason behind our choice was the inclusion rules which exist
For the sake of the FAIR principles, this was not the path chosen by project
between TEI elements and pushed us to look for different combinations. Another
DISCO-LGE, in order to avoid the emergence of a custom usage differing from the
valid approach would have consisted in changing the structure of the inclusion
one documented in the official guidelines.
graph itself, that is to say modify the rules. If `<entry/>` is the perfect
element to encode article themselves, all that is really missing is the ability
The other major reason behind the choice that was ultimately made was the
to accomodate nested structures with the `<div/>` element. This would also have
existing TEI rules governing element inclusions which prompted the search for
the advantage of recovering the `<usg/>` and `<xr/>` elements which we have
different combinations. Another valid approach would have consisted in changing
recognised as useful and which we lose as part of the tradeoff to get nested
the structure of the inclusion graph itself, that is to say modify the rules. If
sections. Generating customised TEI schemas is made really easy with tools like
`<entry/>` is the perfect element to encode article themselves, all that is
ROMA ([https://roma.tei-c.org/](https://roma.tei-c.org/)), which we used to
really missing is the ability to accomodate nested structures with the `<div/>`
preview our change and suggest it to the TEI community.
element. This would also have the advantage of recovering the `<usg/>` and
`<xr/>` elements which appear useful and which are lost as part of the tradeoff
to get nested sections. Generating customised TEI schemas is made really easy
with tools like ROMA ([https://roma.tei-c.org/](https://roma.tei-c.org/)), which
was used to preview this change and suggest it to the TEI community.
Despite it not getting a wide adhesion, some suggested it could be used locally
Despite it not getting a wide adhesion, some suggested it could be used locally
within the scope of project DISCO-LGE. However we chose not to do so, partially
within the scope of project DISCO-LGE. However it was preferred not to do so,
for the same reasons of interoperability as the previous scenario, but also for
partially for the same reasons of interoperability as the previous scenario, but
reasons of sturdiness in front of future evolutions. Making sure the alternative
also for reasons of sturdiness in front of future evolutions. Making sure the
schema would remain useful entails to maintain it, regenerating it should the
alternative schema would remain useful entails to maintain it, regenerating it
schema format evolve, with the risk that the tools to edit it might change or
should the schema format evolve, with the risk that the tools to edit it might
stop being maintained.
stop being maintained or that some conflicts between this change and future
modifications of the official guidelines might arise.
# Conclusion
# Conclusion
Though they are very close genres and share a common history, we have evidenced
Though they are very close genres and share a common history, key differences
key aspects on which dictionaries and encyclopedias differ. Not only do entries
between dictionaries and encyclopedias have been evidenced. Not only do entries
tend to be longer in encyclopedias, they often have a deeper structure too.
tend to be longer in encyclopedias, they often have a deeper structure too.
Their purpose also departs from the purpose of dictionaries from their
Their purpose also departs from the purpose of dictionaries from their
inception, and, as anticipated by their pioneers, results in a different form of
inception, and, as anticipated by their pioneers, results in a different form of
...
@@ -894,15 +981,16 @@ discourse.
...
@@ -894,15 +981,16 @@ discourse.
The structure of the XML-TEI *dictionaries* module reflects the assumptions made
The structure of the XML-TEI *dictionaries* module reflects the assumptions made
by the eponymous genre and does not appear to be flexible enough to accomodate
by the eponymous genre and does not appear to be flexible enough to accomodate
encyclopedias. Forcing its use to some encyclopedic articles would breach the
encyclopedias, despite the colossal effort which has gone into making it
semantics of some elements or require the encoder to break the rules of the
expressive enough for the wide variety of existng dictionaries. Forcing its use
consortium's schema which we think would result in a less reusable encoding in
to some encyclopedic articles would breach the semantics of some elements or
opposition to the FAIR principles.
require the encoder to break the rules of the consortium's schema which would
result in a less reusable encoding in opposition to the FAIR principles.
We have devised and presented an encoding scheme which fully complies with
XML-TEI while being able to represent the content of encyclopedias in all their
An encoding scheme which fully complies with XML-TEI while being able to
complexity. A first implementation of this encoding, incomplete as it may be,
represent the content of encyclopedias in all their complexity has been provided
demonstrates its practical usefulness.
and demonstrated on concrete examples. The tool `soprano`, partially
implementing this set of conventions demonstrates their practical usefulness.