From 6e8e63050ff6e69afe62d520406a64093dbdfcc5 Mon Sep 17 00:00:00 2001 From: Alice BRENON <alice.brenon@ens-lyon.fr> Date: Tue, 23 Jul 2024 20:05:32 +0200 Subject: [PATCH] Fix some more typos and set a text width on articles pictures --- ICHLL_Brenon.md | 113 +++++++++++++++++++++++------------------------- 1 file changed, 55 insertions(+), 58 deletions(-) diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index 7afe17c..9628489 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -190,7 +190,7 @@ lifetime may be achieved by a common effort throughout generations. History hints that Diderot's opponents took his defence of the feasability of the project quite seriously, considering the fact that they got the *EDdA*'s -privileges to be revoked again six years after its publication was resumed +privileges revoked again six years after its publication was resumed [@moureau2001]. As a consequence, the remaining ten volumes containing the text of the articles had to be published illegally until 1765, thanks to the secret protection of Malesherbes who — despite being head of royal censorship — saved @@ -246,7 +246,7 @@ to future scientific projects, which in particular requires it to be *interoperable* and *reusable*. These are the two last key aspects of the FAIR ([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) principles (*findability*, *accessibility*, *interoperability* and -*reusability*) which are important guideline for efficient, high-quality +*reusability*) which are important guidelines for efficient, high-quality research. This section starts by describing the existing toolset provided by the XML-TEI guidelines to achieve this goal, before introducing some notations and tools from graph theory which will be used to browse the guidelines in a @@ -261,7 +261,7 @@ and BASNUM ([https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003)) to encode respectively the *Petit Larousse Illustré* published by Pierre Larousse in 1905 [@bohbot2018, p. 1], roughly contemporary to *LGE*, and the -*Dictionnaire Universel* by Furetière, or rather its second edition edited by +*Dictionnaire Universel* by Furetière, or rather its second version edited by Henri Basnage de Beauval, an encyclopedic dictionary from the very early 18^th^ century [@williams2017, p. 1]. These successes suggested it to be a useful tool to encode encyclopedias but a few differences remained between both projects and @@ -313,7 +313,7 @@ TEI framework could build. The XML-TEI guidelines graph will hence be defined as follows. One node is created for each one of the 590 elements found in the specification. Then, an edge is placed between source node `A` and destination `B` if the schema states -that the element represented by `B` can be contained directly under the element +that the element represented by `B` can be contained directly by the element represented by `A`. That is, the edges in the graph represent the relation "is an admissible direct parent of" (written infix, as in "A is connected to B" if and only if "A is an admissible direct parent of B"). Please note that the word @@ -347,8 +347,8 @@ length of an inclusion path will be called its *depth*. The ability for an element to contain itself corresponds directly to loops on the graph (that is an edge from a node to itself) as can be illustrated by the -`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain -another one. +`<entry/>` element on figure \ref{fig:dictionaries-subgraph}: an `<entry/>` +element (abbreviation) can directly contain another one. The generalisation of this to inclusion paths of any length greater than one is usually called a cycle and it appears natural to refine this and name them @@ -365,7 +365,7 @@ through a `<form/>` or a `<gramGrp/>` because a thorough traversal reporting all the possible paths will contain `entry-form-pos` and `entry-gramGrp-pos`. It is left to the human encoder to rate the relevance of the path found and to select an appropriate one. A total lack of path proves the impossibility of an -inclusion; an abnormally high length for the shortest path is a serious hint +inclusion; an abnormally high depth for the shortest path is a serious hint that the inclusion should not be possible and is not meaningful. Another relevant example of the use of these methods can be given by querying @@ -465,8 +465,8 @@ Secondly, although examples of connections from this module to the rest of the XML-TEI have been evidenced in this section, especially to the *core* module (to which belongs for example the `<ref/>` element), the *dictionaries* module appears somewhat isolated from important structural elements like `<head/>` or -`<div/>`. Indeed, computing all the paths from either `<entry/>` or `<sense/>` -elements to the latter of length shorter or equal to 5 by a systematic traversal +`<div/>`. Indeed, computing all the paths of length shorter or equal to 5 from +either `<entry/>` or `<sense/>` elements to the latter by a systematic traversal of the graph yields exclusively paths (respectively 8 943 and 38 649 of them excluding loops) containing either a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly suggests, is used to encode text that does not @@ -530,7 +530,7 @@ knowledge, and the occurrence at the beginning of articles, more than a tool to clear up possible ambiguities also points the reader to the correct place in this mind map. -)](ressources/arbre.png){width=300px #fig:systeme-figure} +)](ressources/arbre.png){#fig:systeme-figure} The situation regarding subject indicators is hardly better outside of the module. The `<domain/>` element despite its name belongs exclusively in the @@ -559,7 +559,7 @@ describing their relation to events and other persons comes out even further from the notion of meaning. Entries such as the one about SANJO Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*. -)](ressources/sanjo_t29.png){#fig:sanjo} +)](ressources/sanjo_t29.png){#fig:sanjo width=65%} Moreover, encyclopedias, because of all that they have inherited from the philosophical Enlightenment, are not only spaces designed to assert, they also @@ -568,7 +568,7 @@ basis required to understand the complexity of an issue and invite the reader to consider it without providing a definitive answer, going as far as to explicitly use question marks as in the article "Action" displayed in Figure @fig:action. -)](ressources/action_t1.png){#fig:action} +)](ressources/action_t1.png){#fig:action width=65%} In this extract, the author devises a hypothetical situation to illustrate how difficult it is to draw the line between two supposedly mutually exclusive @@ -579,9 +579,9 @@ idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a As a result, the use of `<sense/>` and `<def/>` is not appropriate for encyclopedic content in general. -The final difficulty can be considered as a partial consequence of the previous -one on the structure of articles. The difficulty to define complex concepts is -the very reason why authors approach their subjects from various angles, +The final difficulty can be considered a partial consequence of the previous one +on the structure of articles. The difficulty to define complex concepts is the +very reason why authors approach their subjects from various angles, circumnavigating it as a best approximation. This strategy favours long, structured developments with sections and subsections covering the multiple aspects of the topic: from a historical, political, scientific point of view… @@ -613,23 +613,22 @@ Filtering the content of the module to keep only the elements which can at the same time contain themselves, be included under `<entry/>` and include a `<p/>` and either the `<head/>` or `<title/>` elements yields absolutely no candidates. It is remarkable that even replacing the `<entry/>` element for the root of each -article with an `<entryFree/>`, an element supposed to relax some constraint to -accomodate more unusual structure in dictionaries does not bring any +article with an `<entryFree/>`, an element supposed to relax the constraints to +accomodate more unusual structures in dictionaries does not bring any improvement. -The lack of results from these simple queries forces one to somewhat release the -constraints on the encoding one is willing to use. The occurrence of an -intermediate element could for instance be needed between the element wrapping -the whole article and the recursing one used to encode each section. This -"section" element could also need a companion element to be able to include -itself, or, to formalise it in terms of graph theory, the condition that this -element admits a loop could be relaxed to consider instead cycles of a given -(small, this still needs to represent a fairly direct inclusion) length to be -enough. Simultaneously the maximum depth of the inclusion paths between -`<entry/>`, the pair of elements and the `<p/>` element will be increased to -yield more results. - -By setting this depth to 3, that is, by accepting one intermediate element to +The lack of results from these simple queries forces one to adopt a less +restrictive approach to find an encoding. The occurrence of an intermediate +element could for instance be needed between the element wrapping the whole +article and the recursing one used to encode each section. This "section" +element could also need a companion element to be able to include itself, or, to +formalise it in terms of graph theory, the condition that this element admits a +loop could be relaxed to consider instead cycles of a given (small, this still +needs to represent a fairly direct inclusion) length to be enough. +Simultaneously the maximum depth of the inclusion paths between `<entry/>`, the +pair of elements and the `<p/>` element will be increased to yield more results. + +By setting this depth to 2, that is, by accepting one intermediate element to occur in the middle of each one of the inclusion paths that define the structure required to encode encyclopedic discourse, 21 elements can be found, none of which stands out as an obvious good solution: all paths to include the `<p/>` @@ -641,9 +640,9 @@ works) or a `<state/>` (used to describe a temporary quality in a person or place), again not even close to what is wanted. The paths to either `<head/>` or `<title/>` are similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns the exact same candidates. If that is not a definite -proof that none of these elements could the investigated criteria, it is a fact -than no element in this module stands out as the obvious good solution and a -serious hint to keep looking somewhere else. +proof that none of these elements could meet the investigated criteria, it is a +fact than no element in this module stands out as the obvious good solution and +a serious hint to keep looking somewhere else. Therefore, the search is extended again to include elements outside the *dictionaries* module which could be used to encode the sections and @@ -689,7 +688,7 @@ right under the `<body/>` element representing a whole volume. Everything related to its metadata happens as expected in the file's `<teiHeader/>` which is well-enough equiped to handle them. In order to present the scheme throughout the following section a reference article, "Cathète" from tome 9 — reproduced in -Figure @fig:cathete-photo — will be progressively encoding. +Figure @fig:cathete-photo — will be encoded step by step. )](ressources/cathète_t9.png){#fig:cathete-photo} @@ -715,12 +714,13 @@ sharing the same head. Thus, if an oversegmentation or a subsegmentation are fixed (meaning respectively that two "articles" get fusioned or that one "article" actually contained several which get split as such) only articles with the same headword are impacted. Figure @fig:cathete-xml-0 illustrates this -choice for the container element on the article "Cathète" previously displayed. +choice for the container element on the article "Cathète" displayed on figure +\ref{fig:cathete-photo}. {#fig:cathete-xml-0} Inside this element should be a `<head/>` enclosing the headword of the article. -The usual sub-`<hi/>` elements are available within `<head/>` if the headword is +The usual `<hi/>` elements are available within `<head/>` if the headword is highlighted by any special typographic means such as bold, small capitals, etc. The one disappointment of the encoding scheme being defined in this chapter is the lack of support for a proper way to encode subject indicators. @@ -746,9 +746,9 @@ choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1 Each different meaning could then be wrapped in a separate `<div/>` with the `type` attribute set to `sense` to refer to the `<sense/>` element that would -have been used within the *core* module. The `<div/>`s should be numbered -according to the order they appear in with the `n` attribute starting from `0` -as shown in Figure @fig:cathete-xml-2. +have been used within the *dictionaries* module. The `<div/>`s should be +numbered according to the order they appear in with the `n` attribute starting +from `0` as shown in Figure @fig:cathete-xml-2. {#fig:cathete-xml-2} @@ -761,7 +761,7 @@ information to reconstruct a faithful facsimile but it also has the advantage of highlighting the fact than even though the definition is cut from the headword by being in a separate XML element, they still occur on the same line, which is a typographic choice usually made both in encyclopedias and dictionaries where -space is at a premium. . +space is at a premium. To complete the structure, the various sections and subsections occurring within the article body may be nested as usual with `<div/>` and sub-`<div/>`s, @@ -790,7 +790,7 @@ sole purpose of placing an `<xr/>` but this would add unwanted verbosity to the encoding and implicitly suggest that the previous context was not the one of a dictionary which is rather problematic. -)](ressources/gelocus_t18.png){#fig:gelocus-photo} +)](ressources/gelocus_t18.png){#fig:gelocus-photo width=65%} {#fig:gelocus-xml} @@ -836,27 +836,24 @@ entry's `<div/>` element instead of under a set of nested `<div/>` elements. The paragraphs are not yet identified and for this reason not encoded. However, the figures and their captions are already handled correctly when they -occur. The encoder also keeps track of the current lines, pages, and columns and -inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and -numbers pages so that the numbering corresponding to the physical pages are -available, as compared to the "high-level" pages numbers inserted by the -editors, which start with an offset because the first, blank or almost empty -pages at the beginning of each book do not have a number and which sometimes have -gaps when a full-page geographical map is inserted since those are printed -separately on a different folio which remains outside of the textual numbering -system. The place at which these layout-related elements occur is determined by -the place where the OCR software detected them and by the reordering performed -by `soprano` when inferring the reading order before segmenting the articles. +occur. The encoder also keeps track of the current lines, pages, and columns to +insert the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and number +pages according to the order of the physical pages in the book, as compared to +the "high-level" pages numbers inserted by the editors, which start with an +offset because the first, blank or almost empty pages at the beginning of each +book do not have a number and which sometimes have gaps when a full-page +geographical map is inserted since those are printed separately on a different +folio which remains outside of the textual numbering system. The place at which +these layout-related elements occur is determined by the place where the OCR +software detected them and by the reordering performed by `soprano` when +inferring the reading order before segmenting the articles. ## The constraints of automated processing Encyclopedias are particularly long books, spanning numerous tomes and -containing several tenths of thousands of articles. The *EDdA* comprises -over 74k articles and *LGE* certainly more than 100k (the latest -version produced by `soprano` created 160k articles, but their segmentation is -still not perfect and if some article beginning remain undetected, all the very -long and deeply-structured articles are unduly split into many parts, resulting -globally in an overestimation of the total number). +containing several tenths of thousands of articles. The *EDdA* comprises over +74k articles and *LGE* certainly more than 100k (the latest version produced by +`soprano` created 160k articles, but their segmentation is still not perfect). XML-TEI is a very broad tool useful for very different applications. Some elements like `<unclear/>` or `<factuality/>` can encode subtle semantics -- GitLab