Skip to content
Snippets Groups Projects
Commit 6e8e6305 authored by Alice Brenon's avatar Alice Brenon
Browse files

Fix some more typos and set a text width on articles pictures

parent c38691ae
No related branches found
No related tags found
No related merge requests found
...@@ -190,7 +190,7 @@ lifetime may be achieved by a common effort throughout generations. ...@@ -190,7 +190,7 @@ lifetime may be achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defence of the feasability of History hints that Diderot's opponents took his defence of the feasability of
the project quite seriously, considering the fact that they got the *EDdA*'s the project quite seriously, considering the fact that they got the *EDdA*'s
privileges to be revoked again six years after its publication was resumed privileges revoked again six years after its publication was resumed
[@moureau2001]. As a consequence, the remaining ten volumes containing the text [@moureau2001]. As a consequence, the remaining ten volumes containing the text
of the articles had to be published illegally until 1765, thanks to the secret of the articles had to be published illegally until 1765, thanks to the secret
protection of Malesherbes who — despite being head of royal censorship — saved protection of Malesherbes who — despite being head of royal censorship — saved
...@@ -246,7 +246,7 @@ to future scientific projects, which in particular requires it to be ...@@ -246,7 +246,7 @@ to future scientific projects, which in particular requires it to be
*interoperable* and *reusable*. These are the two last key aspects of the FAIR *interoperable* and *reusable*. These are the two last key aspects of the FAIR
([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) ([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/))
principles (*findability*, *accessibility*, *interoperability* and principles (*findability*, *accessibility*, *interoperability* and
*reusability*) which are important guideline for efficient, high-quality *reusability*) which are important guidelines for efficient, high-quality
research. This section starts by describing the existing toolset provided by the research. This section starts by describing the existing toolset provided by the
XML-TEI guidelines to achieve this goal, before introducing some notations and XML-TEI guidelines to achieve this goal, before introducing some notations and
tools from graph theory which will be used to browse the guidelines in a tools from graph theory which will be used to browse the guidelines in a
...@@ -261,7 +261,7 @@ and BASNUM ...@@ -261,7 +261,7 @@ and BASNUM
([https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003)) ([https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003))
to encode respectively the *Petit Larousse Illustré* published by Pierre to encode respectively the *Petit Larousse Illustré* published by Pierre
Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*, and the Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*, and the
*Dictionnaire Universel* by Furetière, or rather its second edition edited by *Dictionnaire Universel* by Furetière, or rather its second version edited by
Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^ Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^
century [@williams2017, p. 1]. These successes suggested it to be a useful tool century [@williams2017, p. 1]. These successes suggested it to be a useful tool
to encode encyclopedias but a few differences remained between both projects and to encode encyclopedias but a few differences remained between both projects and
...@@ -313,7 +313,7 @@ TEI framework could build. ...@@ -313,7 +313,7 @@ TEI framework could build.
The XML-TEI guidelines graph will hence be defined as follows. One node is The XML-TEI guidelines graph will hence be defined as follows. One node is
created for each one of the 590 elements found in the specification. Then, an created for each one of the 590 elements found in the specification. Then, an
edge is placed between source node `A` and destination `B` if the schema states edge is placed between source node `A` and destination `B` if the schema states
that the element represented by `B` can be contained directly under the element that the element represented by `B` can be contained directly by the element
represented by `A`. That is, the edges in the graph represent the relation "is represented by `A`. That is, the edges in the graph represent the relation "is
an admissible direct parent of" (written infix, as in "A is connected to B" if an admissible direct parent of" (written infix, as in "A is connected to B" if
and only if "A is an admissible direct parent of B"). Please note that the word and only if "A is an admissible direct parent of B"). Please note that the word
...@@ -347,8 +347,8 @@ length of an inclusion path will be called its *depth*. ...@@ -347,8 +347,8 @@ length of an inclusion path will be called its *depth*.
The ability for an element to contain itself corresponds directly to loops on The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain `<entry/>` element on figure \ref{fig:dictionaries-subgraph}: an `<entry/>`
another one. element (abbreviation) can directly contain another one.
The generalisation of this to inclusion paths of any length greater than one is The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle and it appears natural to refine this and name them usually called a cycle and it appears natural to refine this and name them
...@@ -365,7 +365,7 @@ through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all ...@@ -365,7 +365,7 @@ through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all
the possible paths will contain `entry-form-pos` and `entry-gramGrp-pos`. It is the possible paths will contain `entry-form-pos` and `entry-gramGrp-pos`. It is
left to the human encoder to rate the relevance of the path found and to select left to the human encoder to rate the relevance of the path found and to select
an appropriate one. A total lack of path proves the impossibility of an an appropriate one. A total lack of path proves the impossibility of an
inclusion; an abnormally high length for the shortest path is a serious hint inclusion; an abnormally high depth for the shortest path is a serious hint
that the inclusion should not be possible and is not meaningful. that the inclusion should not be possible and is not meaningful.
Another relevant example of the use of these methods can be given by querying Another relevant example of the use of these methods can be given by querying
...@@ -465,8 +465,8 @@ Secondly, although examples of connections from this module to the rest of the ...@@ -465,8 +465,8 @@ Secondly, although examples of connections from this module to the rest of the
XML-TEI have been evidenced in this section, especially to the *core* module (to XML-TEI have been evidenced in this section, especially to the *core* module (to
which belongs for example the `<ref/>` element), the *dictionaries* module which belongs for example the `<ref/>` element), the *dictionaries* module
appears somewhat isolated from important structural elements like `<head/>` or appears somewhat isolated from important structural elements like `<head/>` or
`<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>` `<div/>`. Indeed, computing all the paths of length shorter or equal to 5 from
elements to the latter of length shorter or equal to 5 by a systematic traversal either `<entry/>` or `<sense/>` elements to the latter by a systematic traversal
of the graph yields exclusively paths (respectively 8 943 and 38 649 of them of the graph yields exclusively paths (respectively 8 943 and 38 649 of them
excluding loops) containing either a `<floatingText/>` or an `<app/>` element. excluding loops) containing either a `<floatingText/>` or an `<app/>` element.
The first one, as its name aptly suggests, is used to encode text that does not The first one, as its name aptly suggests, is used to encode text that does not
...@@ -530,7 +530,7 @@ knowledge, and the occurrence at the beginning of articles, more than a tool to ...@@ -530,7 +530,7 @@ knowledge, and the occurrence at the beginning of articles, more than a tool to
clear up possible ambiguities also points the reader to the correct place in clear up possible ambiguities also points the reader to the correct place in
this mind map. this mind map.
!["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie ([Wikimedia Commons](https://commons.wikimedia.org/wiki/File:ENC_SYSTEME_FIGURE.jpeg?uselang=fr#filelinks))](ressources/arbre.png){width=300px #fig:systeme-figure} !["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie ([Wikimedia Commons](https://commons.wikimedia.org/wiki/File:ENC_SYSTEME_FIGURE.jpeg?uselang=fr#filelinks))](ressources/arbre.png){#fig:systeme-figure}
The situation regarding subject indicators is hardly better outside of the The situation regarding subject indicators is hardly better outside of the
module. The `<domain/>` element despite its name belongs exclusively in the module. The `<domain/>` element despite its name belongs exclusively in the
...@@ -559,7 +559,7 @@ describing their relation to events and other persons comes out even further ...@@ -559,7 +559,7 @@ describing their relation to events and other persons comes out even further
from the notion of meaning. Entries such as the one about SANJO Sanetomi (see from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
Figure @fig:sanjo) do not constitute a *definition*. Figure @fig:sanjo) do not constitute a *definition*.
![Beginning of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29 ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/sanjo_t29.png){#fig:sanjo} ![Beginning of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29 ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/sanjo_t29.png){#fig:sanjo width=65%}
Moreover, encyclopedias, because of all that they have inherited from the Moreover, encyclopedias, because of all that they have inherited from the
philosophical Enlightenment, are not only spaces designed to assert, they also philosophical Enlightenment, are not only spaces designed to assert, they also
...@@ -568,7 +568,7 @@ basis required to understand the complexity of an issue and invite the reader to ...@@ -568,7 +568,7 @@ basis required to understand the complexity of an issue and invite the reader to
consider it without providing a definitive answer, going as far as to explicitly consider it without providing a definitive answer, going as far as to explicitly
use question marks as in the article "Action" displayed in Figure @fig:action. use question marks as in the article "Action" displayed in Figure @fig:action.
![Excerpt from article "Action", in La Grande Encyclopédie, tome 1 ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/action_t1.png){#fig:action} ![Excerpt from article "Action", in La Grande Encyclopédie, tome 1 ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/action_t1.png){#fig:action width=65%}
In this extract, the author devises a hypothetical situation to illustrate how In this extract, the author devises a hypothetical situation to illustrate how
difficult it is to draw the line between two supposedly mutually exclusive difficult it is to draw the line between two supposedly mutually exclusive
...@@ -579,9 +579,9 @@ idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a ...@@ -579,9 +579,9 @@ idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
As a result, the use of `<sense/>` and `<def/>` is not appropriate for As a result, the use of `<sense/>` and `<def/>` is not appropriate for
encyclopedic content in general. encyclopedic content in general.
The final difficulty can be considered as a partial consequence of the previous The final difficulty can be considered a partial consequence of the previous one
one on the structure of articles. The difficulty to define complex concepts is on the structure of articles. The difficulty to define complex concepts is the
the very reason why authors approach their subjects from various angles, very reason why authors approach their subjects from various angles,
circumnavigating it as a best approximation. This strategy favours long, circumnavigating it as a best approximation. This strategy favours long,
structured developments with sections and subsections covering the multiple structured developments with sections and subsections covering the multiple
aspects of the topic: from a historical, political, scientific point of view… aspects of the topic: from a historical, political, scientific point of view…
...@@ -613,23 +613,22 @@ Filtering the content of the module to keep only the elements which can at the ...@@ -613,23 +613,22 @@ Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>` same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates. and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
It is remarkable that even replacing the `<entry/>` element for the root of each It is remarkable that even replacing the `<entry/>` element for the root of each
article with an `<entryFree/>`, an element supposed to relax some constraint to article with an `<entryFree/>`, an element supposed to relax the constraints to
accomodate more unusual structure in dictionaries does not bring any accomodate more unusual structures in dictionaries does not bring any
improvement. improvement.
The lack of results from these simple queries forces one to somewhat release the The lack of results from these simple queries forces one to adopt a less
constraints on the encoding one is willing to use. The occurrence of an restrictive approach to find an encoding. The occurrence of an intermediate
intermediate element could for instance be needed between the element wrapping element could for instance be needed between the element wrapping the whole
the whole article and the recursing one used to encode each section. This article and the recursing one used to encode each section. This "section"
"section" element could also need a companion element to be able to include element could also need a companion element to be able to include itself, or, to
itself, or, to formalise it in terms of graph theory, the condition that this formalise it in terms of graph theory, the condition that this element admits a
element admits a loop could be relaxed to consider instead cycles of a given loop could be relaxed to consider instead cycles of a given (small, this still
(small, this still needs to represent a fairly direct inclusion) length to be needs to represent a fairly direct inclusion) length to be enough.
enough. Simultaneously the maximum depth of the inclusion paths between Simultaneously the maximum depth of the inclusion paths between `<entry/>`, the
`<entry/>`, the pair of elements and the `<p/>` element will be increased to pair of elements and the `<p/>` element will be increased to yield more results.
yield more results.
By setting this depth to 2, that is, by accepting one intermediate element to
By setting this depth to 3, that is, by accepting one intermediate element to
occur in the middle of each one of the inclusion paths that define the structure occur in the middle of each one of the inclusion paths that define the structure
required to encode encyclopedic discourse, 21 elements can be found, none of required to encode encyclopedic discourse, 21 elements can be found, none of
which stands out as an obvious good solution: all paths to include the `<p/>` which stands out as an obvious good solution: all paths to include the `<p/>`
...@@ -641,9 +640,9 @@ works) or a `<state/>` (used to describe a temporary quality in a person or ...@@ -641,9 +640,9 @@ works) or a `<state/>` (used to describe a temporary quality in a person or
place), again not even close to what is wanted. The paths to either `<head/>` or place), again not even close to what is wanted. The paths to either `<head/>` or
`<title/>` are similarly disappointing. Again, changing `<entry/>` for `<title/>` are similarly disappointing. Again, changing `<entry/>` for
`<entryFree/>` returns the exact same candidates. If that is not a definite `<entryFree/>` returns the exact same candidates. If that is not a definite
proof that none of these elements could the investigated criteria, it is a fact proof that none of these elements could meet the investigated criteria, it is a
than no element in this module stands out as the obvious good solution and a fact than no element in this module stands out as the obvious good solution and
serious hint to keep looking somewhere else. a serious hint to keep looking somewhere else.
Therefore, the search is extended again to include elements outside the Therefore, the search is extended again to include elements outside the
*dictionaries* module which could be used to encode the sections and *dictionaries* module which could be used to encode the sections and
...@@ -689,7 +688,7 @@ right under the `<body/>` element representing a whole volume. Everything ...@@ -689,7 +688,7 @@ right under the `<body/>` element representing a whole volume. Everything
related to its metadata happens as expected in the file's `<teiHeader/>` which related to its metadata happens as expected in the file's `<teiHeader/>` which
is well-enough equiped to handle them. In order to present the scheme throughout is well-enough equiped to handle them. In order to present the scheme throughout
the following section a reference article, "Cathète" from tome 9 — reproduced in the following section a reference article, "Cathète" from tome 9 — reproduced in
Figure @fig:cathete-photo — will be progressively encoding. Figure @fig:cathete-photo — will be encoded step by step.
![La Grande Encyclopédie, tome 9, article "Cathète" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/cathète_t9.png){#fig:cathete-photo} ![La Grande Encyclopédie, tome 9, article "Cathète" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/cathète_t9.png){#fig:cathete-photo}
...@@ -715,12 +714,13 @@ sharing the same head. Thus, if an oversegmentation or a subsegmentation are ...@@ -715,12 +714,13 @@ sharing the same head. Thus, if an oversegmentation or a subsegmentation are
fixed (meaning respectively that two "articles" get fusioned or that one fixed (meaning respectively that two "articles" get fusioned or that one
"article" actually contained several which get split as such) only articles with "article" actually contained several which get split as such) only articles with
the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this
choice for the container element on the article "Cathète" previously displayed. choice for the container element on the article "Cathète" displayed on figure
\ref{fig:cathete-photo}.
![The container `div` element for article "Cathète"](snippets/cathète_0.png){#fig:cathete-xml-0} ![The container `div` element for article "Cathète"](snippets/cathète_0.png){#fig:cathete-xml-0}
Inside this element should be a `<head/>` enclosing the headword of the article. Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is The usual `<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc. highlighted by any special typographic means such as bold, small capitals, etc.
The one disappointment of the encoding scheme being defined in this chapter is The one disappointment of the encoding scheme being defined in this chapter is
the lack of support for a proper way to encode subject indicators. the lack of support for a proper way to encode subject indicators.
...@@ -746,9 +746,9 @@ choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1 ...@@ -746,9 +746,9 @@ choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1
Each different meaning could then be wrapped in a separate `<div/>` with the Each different meaning could then be wrapped in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would `type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. The `<div/>`s should be numbered have been used within the *dictionaries* module. The `<div/>`s should be
according to the order they appear in with the `n` attribute starting from `0` numbered according to the order they appear in with the `n` attribute starting
as shown in Figure @fig:cathete-xml-2. from `0` as shown in Figure @fig:cathete-xml-2.
![The empty structure for the only meaning of the word "Cathète"](snippets/cathète_2.png){#fig:cathete-xml-2} ![The empty structure for the only meaning of the word "Cathète"](snippets/cathète_2.png){#fig:cathete-xml-2}
...@@ -761,7 +761,7 @@ information to reconstruct a faithful facsimile but it also has the advantage of ...@@ -761,7 +761,7 @@ information to reconstruct a faithful facsimile but it also has the advantage of
highlighting the fact than even though the definition is cut from the headword highlighting the fact than even though the definition is cut from the headword
by being in a separate XML element, they still occur on the same line, which is by being in a separate XML element, they still occur on the same line, which is
a typographic choice usually made both in encyclopedias and dictionaries where a typographic choice usually made both in encyclopedias and dictionaries where
space is at a premium. . space is at a premium.
To complete the structure, the various sections and subsections occurring To complete the structure, the various sections and subsections occurring
within the article body may be nested as usual with `<div/>` and sub-`<div/>`s, within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
...@@ -790,7 +790,7 @@ sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the ...@@ -790,7 +790,7 @@ sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the
encoding and implicitly suggest that the previous context was not the one of a encoding and implicitly suggest that the previous context was not the one of a
dictionary which is rather problematic. dictionary which is rather problematic.
![La Grande Encyclopédie, tome 18, article "Gelocus" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/gelocus_t18.png){#fig:gelocus-photo} ![La Grande Encyclopédie, tome 18, article "Gelocus" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/gelocus_t18.png){#fig:gelocus-photo width=65%}
![Encoding the cross-references in article "Gelocus"](snippets/gelocus.png){#fig:gelocus-xml} ![Encoding the cross-references in article "Gelocus"](snippets/gelocus.png){#fig:gelocus-xml}
...@@ -836,27 +836,24 @@ entry's `<div/>` element instead of under a set of nested `<div/>` elements. The ...@@ -836,27 +836,24 @@ entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
paragraphs are not yet identified and for this reason not encoded. paragraphs are not yet identified and for this reason not encoded.
However, the figures and their captions are already handled correctly when they However, the figures and their captions are already handled correctly when they
occur. The encoder also keeps track of the current lines, pages, and columns and occur. The encoder also keeps track of the current lines, pages, and columns to
inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and insert the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and number
numbers pages so that the numbering corresponding to the physical pages are pages according to the order of the physical pages in the book, as compared to
available, as compared to the "high-level" pages numbers inserted by the the "high-level" pages numbers inserted by the editors, which start with an
editors, which start with an offset because the first, blank or almost empty offset because the first, blank or almost empty pages at the beginning of each
pages at the beginning of each book do not have a number and which sometimes have book do not have a number and which sometimes have gaps when a full-page
gaps when a full-page geographical map is inserted since those are printed geographical map is inserted since those are printed separately on a different
separately on a different folio which remains outside of the textual numbering folio which remains outside of the textual numbering system. The place at which
system. The place at which these layout-related elements occur is determined by these layout-related elements occur is determined by the place where the OCR
the place where the OCR software detected them and by the reordering performed software detected them and by the reordering performed by `soprano` when
by `soprano` when inferring the reading order before segmenting the articles. inferring the reading order before segmenting the articles.
## The constraints of automated processing ## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *EDdA* comprises containing several tenths of thousands of articles. The *EDdA* comprises over
over 74k articles and *LGE* certainly more than 100k (the latest 74k articles and *LGE* certainly more than 100k (the latest version produced by
version produced by `soprano` created 160k articles, but their segmentation is `soprano` created 160k articles, but their segmentation is still not perfect).
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an overestimation of the total number).
XML-TEI is a very broad tool useful for very different applications. Some XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment