Restructure repos to make it more modular, add filter to handle bibliography...

Restructure repos to make it more modular, add filter to handle bibliography and a weird .SECONDEXPANSION trick to the Makefile

Restructure repos to make it more modular, add filter to handle bibliography...
Restructure repos to make it more modular, add filter to handle bibliography and a weird .SECONDEXPANSION trick to the Makefile
bd86a2db · Alice Brenon · 0ad7d16f · 0ad7d16f · bd86a2db · bd86a2db
Commit bd86a2db authored 2 years ago by Alice Brenon
--- a/Bibliographie.md
+++ b/Bibliographie.md
-# Bibliography
--- a/Conclusion.md
+++ b/Conclusion.md
-# Conclusion {-}
-
 ## Regrets

-## Souhaits
+
--- a/Conclusion/Souhaits.md
+++ b/Conclusion/Souhaits.md
+## Souhaits
+
--- a/Conclusion/text.sh
+++ b/Conclusion/text.sh
+#!/bin/sh
+
+source ./chapter.sh 'Conclusion {-}'
+
+cat Conclusion/Regrets.md
+cat Conclusion/Souhaits.md
--- a/Contrastes/Centralité.md
+++ b/Contrastes/Centralité.md
+## Statistiques
+
+### Mesure de centralité
+
+(DKE)
+
+
--- a/Contrastes.md
+++ b/Contrastes.md
-
-# Études contrastives
-
 ## Analyse lexico-grammaticale (Lexicométrie, Textométrique, ?…)

 ### Contrastes Internes
@@ -19,11 +16,4 @@ Np vs. Nc

 #### Adjectifs préférés

-## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents
-
-## Statistiques
-
-### Mesure de centralité
-
-(DKE)

--- a/Contrastes/Phraséologie.md
+++ b/Contrastes/Phraséologie.md
+## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents
+
+
--- a/Contrastes/text.sh
+++ b/Contrastes/text.sh
+#!/bin/sh
+
+source ./chapter.sh 'Études contrastives'
+
+cat Contrastes/Lexicométrie.md
+cat Contrastes/Phraséologie.md
+cat Contrastes/Centralité.md
--- a/Corpus/Annotation.md
+++ b/Corpus/Annotation.md
+## Annotation en parties de discours et syntaxe
+
+### Jeu d'étiquettes
+
+Nous utilisons le [jeu d'étiquettes]() du projet
+[PRESTO](http://presto.ens-lyon.fr/)
+
+Alors non en fait Stanza c'est bien aussi avec les
+[UPOS](https://universaldependencies.org/docs/u/pos/)
+
+### Chaînes de traitement
+
+- PRESTO
+- Stanza
+
+
+
--- a/Corpus.md
+++ b/Corpus.md
-# Préparation et enrichissement du corpus
-
-## Formats et états des textes
-
-### L'Encyclopédie
-
-In common parlance, the terms "dictionaries" and "encyclopedias" are used as
-near synonyms to refer to books compiling vast amounts of knowledge into lists
-of definitions ordered alphabetically. Their similarity is even visible in the
-way they are coordinated in the full title of the *Encyclopédie* which is
-probably the most famous work of the genre and a symbol of the Age of
-Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it
-was much more unusual and in fact controversial when Diderot and d'Alembert
-decided to use it in the title of their book.
-
-The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
-still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
-"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
-by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
-to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
-Encyclopedia"). At the time the word still mostly refers to the abstract concept
-of mastering all knowledges at once. Furetière adds that it's a quality one
-is unlikely to possess, and even seems to condemn its search as a form of
-hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
-("it is a recklessness for a man to want to possess Encyclopedia").
-
-Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
-at the end of the 17^th^ century and attacked in the
-*Dictionnaire Universel François et Latin*, commonly refered to as the
-*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
-"Encyclopédie" remained unchanged in the four editons issued between 1721 and
-1752, mocking the use of the word and discouraging his readers to pursue it. In
-that intent, he quotes a poem from Pibrac encouraging people to specialise in
-only one discipline lest they should not reach perfection, based on an
-argumentation that resembles the saying "Jack of all trades, master of none". It
-is all the more interesting that the definition remains unaltered until 1752,
-one year after the publication of the first volume of the *Encyclopédie*. The
-Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
-*Encyclopédie* which they managed to get banned the same year by the Council of
-State on the charge of attempting to destroy the royal authority, inspiring
-rebellion and corrupting morality in general. There is much more at stake than
-words here, but the attempt to deprecate the word itself is part of their fight
-against the philosophers of the Enlightenment.
-
-The attacks do not remain ignored by Diderot who starts the very definition of
-the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
-directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
-mere self-doubt that their authors should not generalise to anyone, then leaves
-the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues
-that a collaborative work can achieve much more than any talented man could:
-what could possibly not be within reach of a single man, within a single
-lifetime may be achieved by a common effort throughout generations.
-
-History hints that Diderot's opponents took his defence of the feasability of
-the project quite seriously, considering the fact that they got the
-*Encyclopédie*'s privileges to be revoked again six years after its publication
-was resumed [@moureau2001]. As a consequence, the remaining ten volumes
-containing the text of the articles had to be published illegally until 1765,
-thanks to the secret protection of Malesherbes who — despite being head of royal
-censorship — saved the manuscripts from destruction. They were printed secretly
-outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
-Following the high demand from the booksellers who feared they would lose the
-money they had invested in the project, a special privilege was issued for the
-volumes containing the plates, which were released publicly from 1762 to 1772.
-
-In any case, in their last edition in 1771 the authors of the *Dictionnaire de
-Trevoux* had no choice but to acknowledge the success of the encyclopedic
-projects of the 18^th^ century. In this version, the definition
-was entirely reworked, mildly stating that good encyclopedias are difficult to
-make because of the amount of knowledge necessary and work needed to keep up
-with scientific progress instead of calling the effort a parody. It credits
-Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
-though quite explicitly to Diderot and d'Alembert's project by naming the
-collective "Une Société de gens de Lettres" and writing that it started in 1751.
-Even more importantly, two new entries were added after it: one for the
-adjective "encyclopédique" and another one for the noun "encyclopédiste",
-silently admitting how the project had changed its time and the relation to
-knowledge itself.
-
-#### Contexte de l'œuvre
-
-#### Versions disponibles
-
-L'ARTFL[^ARTFL] en propose une version.
-
-[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)
-
-#### Traitements
-
-### La Grande Encyclopédie
-
-#### Contexte de l'œuvre
-
-*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des
-Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut
-publiée en France entre 1885 et 1902 par une équipe de plus de deux cent
-spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ
-1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise
-encyclopédique française majeure à marcher dans les traces de l'ancêtre
-prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt.
-
-Le titre complet de l'œuvre, déjà, montre sa volonté de filiation avec l'*EDdA*,
-volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022].
-
-#### Versions disponibles
-
-Une version numérique de cette œuvre a été réalisée par la BnF et mise en
-ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition
-originale, elle comprend une image par page de l'œuvre, numérisée en niveau de
-gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été
-tirée une version partielle du texte par application d'un programme de
-reconnaissance optique de caractères ([@=OCR]). Cette version présente un
-certains nombre de limite qui empêchait de mener une étude intégrale du texte
-par des moyens automatiques comme la textométrie.
-
-[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071)
-
-D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés,
-certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun
-texte n'est disponible pour ces volumes sur le site de
-Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive
-mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette
-à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été
-choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas
-contigus ni au début ni à la fin de l'œuvre. Ensuite, cette version en «texte
-brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une
-annotation très superficielle et n'est en particulier par segmentée en article.
-Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque
-l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener
-une étude contrastive en groupant les articles par domaine de connaissance ou
-par auteur que d'observer la structure des domaines en comparant entre deux
-encyclopédies quels articles ont été conservés ou non, et le cas échéant si le
-domaine de connaissance qui leur est associé est le même. Enfin, des erreurs
-dans la détection de l'organisation de la page ([@=OLR]) obscurcissent
-significativement le texte en opérant des permutations locales de son contenu
-qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui
-complique nettement la segmentation du texte en article — et dans tous les cas
-endommager la structure des phrases, ce qui est vient introduire des erreurs
-dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que
-nous avons besoin d'appliquer au texte pour faire de la textométrie.
-
-[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#)
-
-Dans le but de pallier à ces défauts, le projet CollEx Persée
-DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en
-partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette
-nouvelle version a été réalisée à partir de photographies d'un exemplaire
-original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal].
-
-[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/)
-[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t)
-[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal)
-
-Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala],
-le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num,
-d'une nouvelle version de l'œuvre sous différents formats.
-
-[^nakala]: [https://nakala.fr/](https://nakala.fr/)
-
-#### Encodage
-
-##### Structure du module *dictionaries*
-
-**Definitions**
-
-By iterating several times the operation of moving on that graph along one edge,
-that is, by considering the transitive closure of the relation "be connected by
-an edge" we define *inclusion paths* which allow us to explore which elements
-may be nested under which other.
-
-The nodes visited along the way represent the intermediate XML elements to
-construct a valid XML tree according to the TEI schema. Given the top-down
-semantics of those trees, we call the length of an inclusion path its *depth*.
-
-The ability for an element to contain itself corresponds directly to loops on
-the graph (that is an edge from a node to itself) as can be illustrated by the
-`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
-another one.
-
-The generalisation of this to inclusion paths of any length greater than one is
-usually called a cycle and we may be tempted in our context to refine this and
-name them *inclusion cycles*. The `<address/>` element provides us with an
-example for this configuration: although an `<address/>` element may not
-directly contain another one, it may contain a `<geogName/>` which, in turn, may
-contain a new `<address/>` element. From a graph theory perspective, we can say
-that it admits an inclusion cycle of length two.
-
-**Applications**
-
-Using classical, well-known methods such as Dĳkstra's algorithm [@dĳkstra59]
-allows us to explore the shortest inclusion paths that exist between elements.
-Though a particular caution should be applied because there is no guarantee that
-the shortest path is meaningful in general, it at least provides us with an
-efficient way to check whether a given element may or not be nested at all under
-another one and gives a lower bound on the length of the path to expect. Of
-course the accuracy of this heuristic decreases as the length of the elements
-increases in the perfect graph representing the intended, meaningful path
-between two nodes that a human specialist of the TEI framework could build.
-
-This is still very useful when taking into account the fact that TEI modules are
-merely "bags" to group the elements and provide hints to human encoders about
-the tools they might need but have no implication on the inclusion paths between
-elements which cross module boundaries freely. The general graph formalism
-enables us to describe complex filtering patterns and to implement queries to
-look for them among the elements exhaustively by algorithmic means even when the
-shortest-path approach is not enough.
-
-For instance, it lets one find that although `<pos/>` may not be directly
-included within `<entry/>` elements to include information about the
-part-of-speech of the word that an article defines, the correct way to do so is
-through a `<form/>` or a `<gramGrp/>`.
-
-On the other hand, trying to discover the shortest inclusion path to `<pos/>`
-from the `<TEI/>` root of the document yields a `<standOff/>`, an element
-dedicated to store contextual data that accompanies but is not part of the text,
-not unlike an annex, and widely unrelated to the context of encoding an
-encyclopedia.
-
-A last relevant example on the use of these methods can be given by querying the
-shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
-yields an inclusion directly through `<entryFree/>` (with an inclusion path of
-length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
-not what we want depending on the regularity of the articles we are encoding and
-the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
-justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
-length 3 returns as expected the path through `<entry/>`, among others. Overall,
-we get a good general idea: `<pos/>` does not need to be nested very deep, it
-can appear quite near the "surface" of article entries.
-
-##### Limites
-
-###### The `<entry/>` element
-
-The central element of the *dictionaries* module is the `<entry/>` element meant
-to encode one single entry in a dictionary, that is to say a head word
-associated to its definition. It is the natural way in from the `<body/>`
-element to the dictionary module: indeed, although `<body/>` may also contain
-`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
-`<entry/>` while the latter is a device to group several related entries
-together. Both can contain an `<entry/` directly while no obvious inclusion
-exists the other way around: most (> 96.2%) of the inclusion paths of
-"reasonable" depth (which we define as strictly inferior to 5, that is twice the
-average shortest depth between any two nodes) either include `<figure/>` or
-`<castList/>`, two very specific elements which should not need to appear in an
-article in general, showing that the purpose of `<entry/>` is not to contain an
-`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
-documentation but also the structure of the elements graph evidence `<entry/>`
-as the natural top-most element for an article. This somewhat contrived example
-hopes to further demonstrate the application of a graph-centred approach to
-understand the inner workings of the XML-TEI schema.
-
-###### Information about the headword itself
-
-Once a block for an article is created, it may contain elements useful to
-represent various of its features. Its written and spoken forms are usually
-encoded by `<form/>` elements. Grammatical information like the `<case/>`,
-`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`,
-along with information about the categories it belongs to like `<iType/>` for
-its inflection class in languages with a declension system or `<pos/>` for its
-part-of-speech. The `<etym/>` element is made to hold the etymology of an entry.
-In the case when there are alternative spellings in varieties of the language or
-if the spelling has changed over time, `<usg/>` can be used.
-
-All these examples are by no means an exhaustive list; the complete set provides
-the encoder with a toolbox to describe all the information related to the form
-the entry is found at and seems general enough to accomodate the structure of
-any book indexing entries by words.
-
-###### Cross-references
-
-A common feature shared by dictionaries and encyclopedias is the ability to
-connect entries together by using a word or short phrase as the link, referring
-the reader to the related concept. This is known as cross-references and can
-appear either when the definition of a term is adjacent to another one or to
-catch alternative spellings where some readers might expect to find the word and
-redirect them to the form chosen as the reference. In XML-TEI, this is done with
-the `<xr/>` element. It usually contains the whole phrase performing the
-redirection, with an imperative locution like "please see […]".
-
-The "active" part of the cross-reference, that is the very word within the
-`<xr/>` that is considered to be the link or, to make a modern-day HTML
-metaphor, the region that would be clickable, is represented by a `<ref/>`
-element. Though it is not specific to the *dictionaries* module, we include it
-in this description of the toolbox because it is particularly useful in the
-context of dictionaries. This element may have a target attribute which points
-to the other resource to be accessed by the interested reader.
-
-###### Definitions
-
-The remaining part of entries is also usually the largest and represents the
-content associated to the headword by the entry. In a dictionary, that is its
-meaning.
-
-The `<sense/>` element is a valid child for `<entry/>` and groups together a
-definition of the term with `<def/>`, usage examples with `<usg/>` (another use
-of this versatile element) and other high-level information such as translations
-in other languages. Both `<def/>` and `<usg/>` elements may appear directly
-under the `<entry/>`.
-
-###### Structural remarks
-
-Before concluding this description of the *dictionaries* module from the
-perspective of someone trying to concretely encode a particular dictionary or
-encyclopedia, we make use of the graph approach again to evidence some its
-aspects in terms of inclusion structure.
-
-First, it is remarkable that all elements in the *dictionaries* module have a
-cyclic inclusion path, that is to say, there is an inclusion path from each
-element of this module to itself. Although having such a cycle is a widespread
-property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
-of the 557 elements in the other modules), all 33 elements of the *dictionaries*
-module having one is far above this average. In addition, the cycles appear to
-be rather short, with an average length of 2.00 versus 2.50 in the rest of the
-population. This observation is all the more surprising considering the fact
-that the *dictionaries* module contains short "leaf" elements like `<pos/>`
-which should not obviously need to admit cycles since one rather expects them to
-contain only one word, like `<pos>adj</pos>` in the example given in the
-official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
-element made to group quotations with a bibliographic reference to their source
-which should clearly be unnecessary to encode an article in the general case.
-
-Secondly, although we have seen examples of connections from this module to the
-rest of the XML-TEI, especially to the *core* module (see the case of the
-`<ref/>` element above), the *dictionaries* module appears somewhat isolated
-from important structural elements like `<head/>` or `<div/>`. Indeed, computing
-all the paths from either `<entry/>` or `<sense/>` elements to the latter of
-length shorter or equal to 5 by a systematic traversal of the graph yields
-exclusively paths (respectively 9042 and 39093 of them) containing either a
-`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
-suggests, is used to encode text that does not quite fit the regular flow of the
-document, as for example in the context of an embedded narrative. Both examples
-displayed in the online documentation feature a `<body/>` as direct child of
-`<floatingText/>`, neatly separating its content as independent. The purpose of
-the second one, although its name — short for apparatus — is less clear, is to
-wrap together several versions of the same excerpts, for instance when there are
-several possible readings of an unclear group of words in a manuscript, or when
-the encoder is trying to compile a single version of a piece of work from
-several sources which disagree over some passage. In both case, it appears
-obvious that it is not something that is expected to occur naturally in the
-course of an article in general.
-
-Thus, despite a rather dense internal connectivity, the *dictionaries* module
-fails to provide encoders with a device to represent recursively nesting
-structures like `<div/>`.
-
-The situation regarding subject indicators is hardly better outside of the
-module. The `<domain/>` element despite its name belongs exclusively in the
-header of a document and focuses on the social context of the text, not on the
-knowledge area it covers. The `<interp/>` despite its name is not so much about
-labeling something as an interpretation to give to a context (which subject
-indicators could be if you consider that, placed at the beginning, they are used
-to direct the mind frame of the readers towards a particular subject). However,
-the documentation clearly demonstrates it as a tool for annotators of a
-document, which text content is not part of the original document but some
-additional result of an analysis performed in the context of the encoding, used
-only throughout references in XML attributes.
-
-This point, although not the most concerning, still remains the hardest to
-address but all things considered the `<usg/>` element stands out as the most
-relevant.
-
-###### The notion of meaning
-
-Notwithstanding the correct way to represent domains of knowledge, their extent
-itself raises concerns regarding the *dictionaries* module. Indeed, among the
-vast collection of domains covered in encyclopedias in general and in *La Grande
-Encyclopédie* in particular are historical articles and biographies. If the
-notion of meaning can appear at least ill-fitting for a text describing a series
-of historical events, one may still argue that it groups them into a concept and
-associates it to the name of the event. But when it comes to relating the life
-of a person, describing their relation to events and other persons comes out
-even further from the notion of meaning. Entries such as the one about SANJO
-Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
-
-![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29](figure/article/LGE/sanjo_t29.png){#fig:sanjo}
-
-Moreover, encyclopedias, because of all that they have inherited from the
-philosophical Enlightenment, are not only spaces designed to assert, they also
-intrinsically include an interrogative component. Some articles lay down the
-basis required to understand the complexity of an issue and invite the reader to
-consider it without providing a definitive answer, going as far as to explicitly
-use question marks as in the article "Action" displayed in Figure @fig:action.
-
-![Excerpt from article "Action", in La Grande Encyclopédie, tome 1](figure/article/LGE/action_t1.png){#fig:action}
-
-In this extract, the author devises a hypothetical situation to illustrate how
-difficult it is to draw the line between two supposedly mutually exclusive
-subcategories of legal actions. The whole point of the passage is to convey the
-idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
-`<def/>` element would be an utter misnomer.
-
-As a result, the use of `<sense/>` and `<def/>` is not appropriate for
-encyclopedic content in general.
-
-###### Nested structures
-
-The final difficulty can be considered as a partial consequence of the previous
-one on the structure of articles. The difficulty to define complex concepts is
-the very reason why authors approach their subjects from various angles,
-circumnavigating it as a best approximation. This strategy favours long,
-structured developments with sections and subsections covering the multiple
-aspects of the topic: from a historical, political, scientific point of view…
-The longest articles, such as article "Europe" shown in Figure @fig:europe, can
-thus span several dozens of pages. They can contain substructures with titles on
-at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of
-which are in turn generally developed over several paragraphs.
-
-![La Grande Encyclopédie, tome 16, article "Europe", spanning from p.782 to p.846, that is 64 pages, and ending after a bibliography longer than one column of text](figure/article/LGE/europe_t16.png){#fig:europe}
-
-The nested structure that we have just evidenced demands of course a nesting
-structure to accomodate it. More precisely it guides our search of XML elements
-by giving us several constraints: we are looking for a pair of elements, the
-first representing a (sub)section must be able to include both itself and the
-second element, which does not have any special constraint except the one to
-have a semantics compatible with our purpose of using it to represent section
-titles. In addition, the first element must be able to contain several `<p/>`
-elements, `<p/>` being the reference element to encode paragraphs according to
-the XML-TEI documentation.
-
-We have seen that the *dictionaries* module was equiped with a questionable but
-possible element for subject domains. However, it does not include any element
-for section titles. In the rest of the TEI specification, the elements `<head/>`
-and `<title/>` — the latter with the possibility to set its `type` attribute to
-`sub` — stand out as the best candidates for the semantics condition on the
-second element.
-
-##### Choix
-
-###### Candidates in the *dictionaries* module
-
-Filtering the content of the module to keep only the elements which can at the
-same time contain themselves, be included under `<entry/>` and include a `<p/>`
-and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
-It is remarkable that even replacing the `<entry/>` element for the root of each
-article with an `<entryFree/>`, an element supposed to relax some constraint to
-accomodate more unusual structure in dictionaries does not bring any
-improvement.
-
-The lack of results from these simple queries forces us to somewhat release the
-constraints on the encoding we are willing to use. We can for instance make the
-asumption that the occurrence of an intermediate element could be needed between
-the element wrapping the whole article and the recursing one used to encode each
-section. This "section" element could also need a companion element to be able
-to include itself, or, to formalise it in terms of graph theory, we could relax
-the condition that this element admits a loop to consider instead cycles of a
-given (small, this still needs to represent a fairly direct inclusion) length to
-be enough. We simultaneously extend the maximum depth of the inclusion paths we
-are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
-
-By setting this depth to 3, that is, by accepting one intermediate element to
-occur in the middle of each one of the inclusion paths that define the structure
-required to encode encyclopedic discourse, we find 21 elements but none of them
-stand out as an obvious good solution: all paths to include the `<p/>` element
-from any *dictionaries* element either contains a `<figure/>` (which we have
-encountered earlier when we were practising our graph approach to search for
-inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
-general), a `<stage/>` (reserved to stage direction in dramatic works) or a
-`<state/>` (used to describe a temporary quality in a person or place), again
-not even close to what we want. The paths to either `<head/>` or `<title/>` are
-similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns
-the exact same candidates. If that is not a thorough proof that none of these
-elements could fulfill our purpose, it is a fact than no element in this module
-appears as an obvious good solution and a serious hint to keep looking somewhere
-else.
-
-###### Widening the search
-
-We hence widen our search to include elements outside the *dictionaries* module
-which could be used to encode our sections and subsections, under the same
-constraint as before to try and find a composite solution that would remain
-under the `<entry/>` element even if resorting to subcomponents outside of the
-dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>`
-and `<note/>`.
-
-The first one as we have repeatedly underlined is meant for graphic information
-and is not suitable for text content in general.
-
-The purpose of `<metamark/>` is to transcribe the edition marks than may appear
-on a particular primary source in order to alter the normal flow of the text and
-suggest an alternative reading (deletion, insertion, reordering, this is about a
-human editing the text from a given physical copy of it), but it is
-unfortunately of no use to encode a section of an article.
-
-The first element that might at least resemble what we are looking for is the
-last one, `<note/>`. It is meant to contain text, is about explaning something
-and seems general enough (not specific to a given genre, or to the occurrence of
-a particular object on the page). Unfortunately, its semantics still seems a bit
-off compared to our need. The documentation describes it as an "additional
-comment" which appears "out of the main textual stream" whereas the long
-developments in articles are the very matter of the text of encyclopedias, not
-mere remarks in the margins or at the foot of pages.
-
-##### Implémentation
-
-The above remarks explain why the *dictionary* module is unable to represent
-encyclopedias, where the notion of "meaning" is less central that in
-dictionaries and where discourse with nested structures of arbitrary depth can
-occur. Even composite encodings using elements outside of the *dictionaries*
-module under an `<entry/>` element do not meet our requirements. Since the
-*core* module of course accomodates these structures by means of the `<div/>`,
-`<head/>` and `<p/>` elements which have the additional advantage of carrying
-less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
-using them which we recommend using for other projects aiming at representing
-encyclopedias.
-
-To remain consistent with the above remarks we will only concern ourselves with
-what happens at the level of each article, right under the `<body/>` element.
-Everything related to metadata happens as expected in the file's `<teiHeader/>`
-which is well-enough equiped to handle them. In order to present our scheme
-throughout the following section we will be progressively encoding a reference
-article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
-
-![La Grande Encyclopédie, tome 9, article "Cathète"](figure/article/LGE/cathète_t9.png){#fig:cathete-photo}
-
-###### The scheme
-
-Remaining within the *core* module for the structure, almost all useful elements
-are available and our encoding scheme merely quotes the official documentation.
-Each article is represented by a `<div/>`. We suggest setting an `xml:id`
-attribute on it with the head word of the entry — unique in the whole corpus, or
-made so by suffixing a number representing its rank among the various
-occurrences, even when there's only one for the sake of regularity — as its
-value, normalised to lowercase, stripping spaces and replacing all
-non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
-encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
-element on the article "Cathète" previously displayed.
-
-![The container `div` element for article "Cathète"](figure/article/LGE/cathète_0.png){#fig:cathete-xml-0}
-
-Inside this element should be a `<head/>` enclosing the headword of the article.
-The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
-highlighted by any special typographic means such as bold, small capitals, etc.
-The one disappointment of the encoding scheme we are defining in this chapter is
-the lack of support for a proper way to encode subject indicators.
-
-The best candidate we have found so far was `<usg/>` from the *dictionaries*
-module but it is not available directly under a `<head/>` element. All inclusion
-paths from the latter to the former of length less than or equal to 3 contain
-irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
-must be discarded. The next best elements appear to be `<term/>` (not very
-accurate) and `<rs/>` ("referring string", quite a general semantics but a
-possible match — subject indicators refer to a given domain of knowledge —
-although all the examples in the documentation refer to concrete persons,
-places or object, not to the abstract objects that mathematics or poetry are).
-
-For this reason, we do not recommend any special encoding of the subject
-indicator but leave it open to each particular context: they are often
-abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
-are not labeled by a knowledge domain but usually include the first name of the
-person when it is known so in that case an element like `<persName/>` is still
-appropriate. This choice applied to the same article "Cathète" produces Figure
-@fig:cathete-xml-1.
-
-![Encoding the head word of article "Cathète"](figure/article/LGE/cathète_1.png){#fig:cathete-xml-1}
-
-We then propose to wrap each different meaning in a separate `<div/>` with the
-`type` attribute set to `sense` to refer to the `<sense/>` element that would
-have been used within the *core* module. The `<div/>`s should be numbered
-according to the order they appear in with the `n` attribute starting from `0`
-as shown in Figure @fig:cathete-xml-2.
-
-![The empty structure for the only meaning of the word "Cathète"](figure/article/LGE/cathète_2.png){#fig:cathete-xml-2}
-
-In addition, each line within the article must start with a `<lb/>` to mark its
-beginning including before the `<head/>` element as demonstrated by Figure
-@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that
-in the dense layout of encyclopedias, the carriage return separating two
-articles is meaningful. Stating each new line explicitly keeps enough
-information to reconstruct a faithful facsimile but it also has the advantage of
-highlighting the fact than even though the definition is cut from the headword
-by being in a separate XML element, they still occur on the same line, which is
-a typographic choice usually made both in encyclopedias and dictionaries where
-space is at a premium. .
-
-To complete the structure, the various sections and subsections occurring
-within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
-filled with `<p/>` for paragraphs which can each be titled with `<head/>`
-elements local to each `<div/>`.
-
-![A complete encoding of article "Cathète"](figure/article/LGE/cathète_3.png){#fig:cathete-xml-3}
-
-Some articles such as "Boumerang" have figures with captions, as illustrated by
-Figure @fig:boumerang-photo, which should be encoded the standard way by
-`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml.
-
-![La Grande Encyclopédie, tome 7, article "Boumerang"](figure/article/LGE/boumerang_t7.png){height=300px #fig:boumerang-photo}
-
-![Encoding the figure in article "Boumerang" and its captions](figure/article/LGE/boumerang.png){#fig:boumerang-xml}
-
-Another issue arising from giving up on `<entry/>` is the unavailability of the
-`<xr/>` element, not allowed under any of the *core* elements we use but which
-is useful to represent cross-references occurring in encyclopedias as well as in
-dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
-We prefer to use the `<ref/>` element instead which is available in the context
-of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
-article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
-Another solution would have been to introduce a `<dictScrap/>` element for the
-sole purpose of placing an `<xr/>` but we advocate against it on account of the
-verbosity it would add to the encoding and the fact that it implicitly suggests
-that the previous context was not the one of a dictionary.
-
-![La Grande Encyclopédie, tome 18, article "Gelocus"](figure/article/LGE/gelocus_t18.png){#fig:gelocus-photo}
-
-![Encoding the cross-references in article "Gelocus"](figure/article/LGE/gelocus.png){#fig:gelocus-xml}
-
-A typical page of an encyclopedia also features peritext elements, giving
-information to the reader about the current page number along with the headwords
-of the first and last articles appearing on the page. Those can be encoded by
-`<fw/>` elements ("forme work") which `place` and `type` attributes should be
-set to position them on the page and identify their function if it has been
-recognised (those short elements on the border of pages are the ones typically
-prone to suffer damages or be misread by the OCR).
-
-Finally there are other TEI elements useful to represent "events" in the flow of
-the text, like the beginning of a new column of text or of a new page. Figure
-@fig:alcala-photo shows the top left of the last page of the first tome of *La
-Grande Encyclopédie* which features peritext elements while marking the
-beginning of a new page. The usual appropriate elements (`<pb/>` for page
-beginning, `<cb/>` for column beginning) may and should be used with our
-encoding scheme as demonstrated by Figure @fig:alcala-xml.
-
-![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](figure/article/LGE/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
-
-![Encoding the beginning of a page in article "Alcala-de-Hénarès"](figure/article/LGE/alcala.png){#fig:alcala-xml}
-
-###### Currently implemented
-
-The reference implementation for this encoding scheme is the program
-soprano[^soprano] developed within the scope of project DISCO-LGE to
-automatically identify individual articles in the flow of raw text from the
-columns and to encode them into XML-TEI files. Though this software has already
-been used to produce the first TEI version of *La Grande Encyclopédie*, it does
-not yet follow the above specification perfectly. Figure
-@fig:cathete-xml-current shows the encoded version of article "Cathète" it
-currently produces:
-
-[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
-
-![The current encoding of article "Cathète" produced by `soprano`](figure/article/LGE/cathète_current.png){#fig:cathete-xml-current}
-
-The headword detection system is not able to capture the subject indicators yet
-so it appears outside of the `<head/>` element. No work is performed either to
-expand abbreviations and encode them as such, or to distinguish between domain
-and people names.
-
-Likewise, since the detection of titles at the beginning of each section is not
-complete, no structure analysis can be performed at the moment on the textual
-development inside the article and it is left unstructured, directly under the
-entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
-paragraphs are not yet identified and for this reason not encoded.
-
-However, the figures and their captions are already handled correctly when they
-occur. The encoder also keeps track of the current lines, pages, and columns and
-inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
-numbers pages so that the numbering corresponding to the physical pages are
-available, as compared to the "high-level" pages numbers inserted by the
-editors, which start with an offset because the first, blank or almost empty
-pages at the beginning of each book do not have a number and which sometimes have
-gaps when a full-page geographical map is inserted since those are printed
-separately on a different folio which remains outside of the textual numbering
-system. The place at which these layout-related elements occur is determined by
-the place where the OCR software detected them and by the reordering performed
-by `soprano` when inferring the reading order before segmenting the articles.
-
-###### The constraints of automated processing
-
-Encyclopedias are particularly long books, spanning numerous tomes and
-containing several tenths of thousands of articles. The *Encyclopédie* comprises
-over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
-version produced by `soprano` created 160k articles, but their segmentation is
-still not perfect and if some article beginning remain undetected, all the very
-long and deeply-structured articles are unduly split into many parts, resulting
-globally in an overestimation of the total number).
-
-XML-TEI is a very broad tool useful for very different applications. Some
-elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
-information (for the second one, adjacent to a notion as elusive as truth)
-which requires a very deep understanding of a text in its entirety and about
-which even some human experts may disagree.
-
-For these reasons, a central concern in the design of our encoding scheme was to
-remain within the boundaries of information that can be described objectively
-and extracted automatically by an algorithm. Most of the tags presented above
-contain information about the positions of the elements or their relation to one
-another. Those with an additional semantics implication like `<head/>` can be
-inferred simply from their position and the frequent use of a special typography
-like bold or upper-case characters.
-
-The case of cross-references is particular and may appear as a counter-example
-to the main principle on which our scheme is based. Actually, the process of
-linking from an article to another one is so frequent (in dictionaries as well
-as in encyclopedias) that it generally escapes the scope of regular discourse to
-take a special and often fixed form, inside parenthesis and after a special
-token which invites the reader to perform the redirection. In *La Grande
-Encyclopédie*, virtually all the redirections (that is, to the extent of our
-knowledge, absolutely all of them though of course some special case may exist,
-but they are statistically rare enough that we have not found any yet) appear
-within parenthesis, and start with the verb "voir" abbreviated as a single,
-capital "V." as illustrated above in the article "Gelocus" (see again Figure
-@fig:gelocus-photo).
-
-Although this has not been implemented yet either, we hope to be able to detect
-and exploit those patterns to correctly encode cross-references. Getting the
-`target` attributes right is certainly more difficult to achieve and may require
-processing the articles in several steps, to first discover all the existing
-headwords — and hence article IDs — before trying to match the words following
-"V." with them. Since our automated encoder handles tomes separately and since
-references may cross the boundaries of tomes, it cannot wait for the target of a
-cross-reference to be discovered by keeping the articles in memory before
-outputting them.
-
-This is in line with the last important aspect of our encoder. If many
-lexicographers may deem our encoding too shallow, it has the advantage of not
-requiring to keep too complex datastructures in memory for a long time. The
-algorithm implementing it in `soprano` outputs elements as soon as it can, for
-instance the empty elements already discussed above. For articles, it pushes
-lines onto a stack and flushes it each time it encounters the beginning of the
-following article. This allows the amount of memory required to remain
-reasonable and even lets them be parallelised on most modern machines. Thus,
-even taking over three minutes per tome, the total processing time can be
-lowered to around forty minutes on a machine with 16Go of RAM for the whole of
-*La Grande Encyclopédie* instead of over one hour and a half.
-
 ## Les domaines

 ### Systèmes de domaines
@@ -1499,19 +776,4 @@ TODO Comment être plus maligne dans l'association ?
 TODO Grammaire des articles


-## Annotation en parties de discours et syntaxe
-
-### Jeu d'étiquettes
-
-Nous utilisons le [jeu d'étiquettes]() du projet
-[PRESTO](http://presto.ens-lyon.fr/)
-
-Alors non en fait Stanza c'est bien aussi avec les
-[UPOS](https://universaldependencies.org/docs/u/pos/)
-
-### Chaînes de traitement
-
- PRESTO
- Stanza
-

--- a/Corpus/Formats_et_états.md
+++ b/Corpus/Formats_et_états.md
+## Formats et états des textes
+
+### L'Encyclopédie
+
+In common parlance, the terms "dictionaries" and "encyclopedias" are used as
+near synonyms to refer to books compiling vast amounts of knowledge into lists
+of definitions ordered alphabetically. Their similarity is even visible in the
+way they are coordinated in the full title of the *Encyclopédie* which is
+probably the most famous work of the genre and a symbol of the Age of
+Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it
+was much more unusual and in fact controversial when Diderot and d'Alembert
+decided to use it in the title of their book.
+
+The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
+still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
+"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
+by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
+to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
+Encyclopedia"). At the time the word still mostly refers to the abstract concept
+of mastering all knowledges at once. Furetière adds that it's a quality one
+is unlikely to possess, and even seems to condemn its search as a form of
+hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
+("it is a recklessness for a man to want to possess Encyclopedia").
+
+Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
+at the end of the 17^th^ century and attacked in the
+*Dictionnaire Universel François et Latin*, commonly refered to as the
+*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
+"Encyclopédie" remained unchanged in the four editons issued between 1721 and
+1752, mocking the use of the word and discouraging his readers to pursue it. In
+that intent, he quotes a poem from Pibrac encouraging people to specialise in
+only one discipline lest they should not reach perfection, based on an
+argumentation that resembles the saying "Jack of all trades, master of none". It
+is all the more interesting that the definition remains unaltered until 1752,
+one year after the publication of the first volume of the *Encyclopédie*. The
+Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
+*Encyclopédie* which they managed to get banned the same year by the Council of
+State on the charge of attempting to destroy the royal authority, inspiring
+rebellion and corrupting morality in general. There is much more at stake than
+words here, but the attempt to deprecate the word itself is part of their fight
+against the philosophers of the Enlightenment.
+
+The attacks do not remain ignored by Diderot who starts the very definition of
+the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
+directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
+mere self-doubt that their authors should not generalise to anyone, then leaves
+the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues
+that a collaborative work can achieve much more than any talented man could:
+what could possibly not be within reach of a single man, within a single
+lifetime may be achieved by a common effort throughout generations.
+
+History hints that Diderot's opponents took his defence of the feasability of
+the project quite seriously, considering the fact that they got the
+*Encyclopédie*'s privileges to be revoked again six years after its publication
+was resumed [@moureau2001]. As a consequence, the remaining ten volumes
+containing the text of the articles had to be published illegally until 1765,
+thanks to the secret protection of Malesherbes who — despite being head of royal
+censorship — saved the manuscripts from destruction. They were printed secretly
+outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
+Following the high demand from the booksellers who feared they would lose the
+money they had invested in the project, a special privilege was issued for the
+volumes containing the plates, which were released publicly from 1762 to 1772.
+
+In any case, in their last edition in 1771 the authors of the *Dictionnaire de
+Trevoux* had no choice but to acknowledge the success of the encyclopedic
+projects of the 18^th^ century. In this version, the definition
+was entirely reworked, mildly stating that good encyclopedias are difficult to
+make because of the amount of knowledge necessary and work needed to keep up
+with scientific progress instead of calling the effort a parody. It credits
+Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
+though quite explicitly to Diderot and d'Alembert's project by naming the
+collective "Une Société de gens de Lettres" and writing that it started in 1751.
+Even more importantly, two new entries were added after it: one for the
+adjective "encyclopédique" and another one for the noun "encyclopédiste",
+silently admitting how the project had changed its time and the relation to
+knowledge itself.
+
+#### Contexte de l'œuvre
+
+#### Versions disponibles
+
+L'ARTFL[^ARTFL] en propose une version.
+
+[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)
+
+#### Traitements
+
+### La Grande Encyclopédie
+
+#### Contexte de l'œuvre
+
+*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des
+Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut
+publiée en France entre 1885 et 1902 par une équipe de plus de deux cent
+spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ
+1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise
+encyclopédique française majeure à marcher dans les traces de l'ancêtre
+prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt.
+
+Le titre complet de l'œuvre, déjà, montre sa volonté de filiation avec l'*EDdA*,
+volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022].
+
+#### Versions disponibles
+
+Une version numérique de cette œuvre a été réalisée par la BnF et mise en
+ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition
+originale, elle comprend une image par page de l'œuvre, numérisée en niveau de
+gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été
+tirée une version partielle du texte par application d'un programme de
+reconnaissance optique de caractères ([@=OCR]). Cette version présente un
+certains nombre de limite qui empêchait de mener une étude intégrale du texte
+par des moyens automatiques comme la textométrie.
+
+[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071)
+
+D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés,
+certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun
+texte n'est disponible pour ces volumes sur le site de
+Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive
+mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette
+à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été
+choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas
+contigus ni au début ni à la fin de l'œuvre. Ensuite, cette version en «texte
+brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une
+annotation très superficielle et n'est en particulier par segmentée en article.
+Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque
+l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener
+une étude contrastive en groupant les articles par domaine de connaissance ou
+par auteur que d'observer la structure des domaines en comparant entre deux
+encyclopédies quels articles ont été conservés ou non, et le cas échéant si le
+domaine de connaissance qui leur est associé est le même. Enfin, des erreurs
+dans la détection de l'organisation de la page ([@=OLR]) obscurcissent
+significativement le texte en opérant des permutations locales de son contenu
+qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui
+complique nettement la segmentation du texte en article — et dans tous les cas
+endommager la structure des phrases, ce qui est vient introduire des erreurs
+dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que
+nous avons besoin d'appliquer au texte pour faire de la textométrie.
+
+[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#)
+
+Dans le but de pallier à ces défauts, le projet CollEx Persée
+DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en
+partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette
+nouvelle version a été réalisée à partir de photographies d'un exemplaire
+original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal].
+
+[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/)
+[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t)
+[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal)
+
+Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala],
+le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num,
+d'une nouvelle version de l'œuvre sous différents formats.
+
+[^nakala]: [https://nakala.fr/](https://nakala.fr/)
+
+#### Encodage
+
+##### Structure du module *dictionaries*
+
+**Definitions**
+
+By iterating several times the operation of moving on that graph along one edge,
+that is, by considering the transitive closure of the relation "be connected by
+an edge" we define *inclusion paths* which allow us to explore which elements
+may be nested under which other.
+
+The nodes visited along the way represent the intermediate XML elements to
+construct a valid XML tree according to the TEI schema. Given the top-down
+semantics of those trees, we call the length of an inclusion path its *depth*.
+
+The ability for an element to contain itself corresponds directly to loops on
+the graph (that is an edge from a node to itself) as can be illustrated by the
+`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
+another one.
+
+The generalisation of this to inclusion paths of any length greater than one is
+usually called a cycle and we may be tempted in our context to refine this and
+name them *inclusion cycles*. The `<address/>` element provides us with an
+example for this configuration: although an `<address/>` element may not
+directly contain another one, it may contain a `<geogName/>` which, in turn, may
+contain a new `<address/>` element. From a graph theory perspective, we can say
+that it admits an inclusion cycle of length two.
+
+**Applications**
+
+Using classical, well-known methods such as Dĳkstra's algorithm [@dĳkstra59]
+allows us to explore the shortest inclusion paths that exist between elements.
+Though a particular caution should be applied because there is no guarantee that
+the shortest path is meaningful in general, it at least provides us with an
+efficient way to check whether a given element may or not be nested at all under
+another one and gives a lower bound on the length of the path to expect. Of
+course the accuracy of this heuristic decreases as the length of the elements
+increases in the perfect graph representing the intended, meaningful path
+between two nodes that a human specialist of the TEI framework could build.
+
+This is still very useful when taking into account the fact that TEI modules are
+merely "bags" to group the elements and provide hints to human encoders about
+the tools they might need but have no implication on the inclusion paths between
+elements which cross module boundaries freely. The general graph formalism
+enables us to describe complex filtering patterns and to implement queries to
+look for them among the elements exhaustively by algorithmic means even when the
+shortest-path approach is not enough.
+
+For instance, it lets one find that although `<pos/>` may not be directly
+included within `<entry/>` elements to include information about the
+part-of-speech of the word that an article defines, the correct way to do so is
+through a `<form/>` or a `<gramGrp/>`.
+
+On the other hand, trying to discover the shortest inclusion path to `<pos/>`
+from the `<TEI/>` root of the document yields a `<standOff/>`, an element
+dedicated to store contextual data that accompanies but is not part of the text,
+not unlike an annex, and widely unrelated to the context of encoding an
+encyclopedia.
+
+A last relevant example on the use of these methods can be given by querying the
+shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
+yields an inclusion directly through `<entryFree/>` (with an inclusion path of
+length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
+not what we want depending on the regularity of the articles we are encoding and
+the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
+justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
+length 3 returns as expected the path through `<entry/>`, among others. Overall,
+we get a good general idea: `<pos/>` does not need to be nested very deep, it
+can appear quite near the "surface" of article entries.
+
+##### Limites
+
+###### The `<entry/>` element
+
+The central element of the *dictionaries* module is the `<entry/>` element meant
+to encode one single entry in a dictionary, that is to say a head word
+associated to its definition. It is the natural way in from the `<body/>`
+element to the dictionary module: indeed, although `<body/>` may also contain
+`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
+`<entry/>` while the latter is a device to group several related entries
+together. Both can contain an `<entry/` directly while no obvious inclusion
+exists the other way around: most (> 96.2%) of the inclusion paths of
+"reasonable" depth (which we define as strictly inferior to 5, that is twice the
+average shortest depth between any two nodes) either include `<figure/>` or
+`<castList/>`, two very specific elements which should not need to appear in an
+article in general, showing that the purpose of `<entry/>` is not to contain an
+`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
+documentation but also the structure of the elements graph evidence `<entry/>`
+as the natural top-most element for an article. This somewhat contrived example
+hopes to further demonstrate the application of a graph-centred approach to
+understand the inner workings of the XML-TEI schema.
+
+###### Information about the headword itself
+
+Once a block for an article is created, it may contain elements useful to
+represent various of its features. Its written and spoken forms are usually
+encoded by `<form/>` elements. Grammatical information like the `<case/>`,
+`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`,
+along with information about the categories it belongs to like `<iType/>` for
+its inflection class in languages with a declension system or `<pos/>` for its
+part-of-speech. The `<etym/>` element is made to hold the etymology of an entry.
+In the case when there are alternative spellings in varieties of the language or
+if the spelling has changed over time, `<usg/>` can be used.
+
+All these examples are by no means an exhaustive list; the complete set provides
+the encoder with a toolbox to describe all the information related to the form
+the entry is found at and seems general enough to accomodate the structure of
+any book indexing entries by words.
+
+###### Cross-references
+
+A common feature shared by dictionaries and encyclopedias is the ability to
+connect entries together by using a word or short phrase as the link, referring
+the reader to the related concept. This is known as cross-references and can
+appear either when the definition of a term is adjacent to another one or to
+catch alternative spellings where some readers might expect to find the word and
+redirect them to the form chosen as the reference. In XML-TEI, this is done with
+the `<xr/>` element. It usually contains the whole phrase performing the
+redirection, with an imperative locution like "please see […]".
+
+The "active" part of the cross-reference, that is the very word within the
+`<xr/>` that is considered to be the link or, to make a modern-day HTML
+metaphor, the region that would be clickable, is represented by a `<ref/>`
+element. Though it is not specific to the *dictionaries* module, we include it
+in this description of the toolbox because it is particularly useful in the
+context of dictionaries. This element may have a target attribute which points
+to the other resource to be accessed by the interested reader.
+
+###### Definitions
+
+The remaining part of entries is also usually the largest and represents the
+content associated to the headword by the entry. In a dictionary, that is its
+meaning.
+
+The `<sense/>` element is a valid child for `<entry/>` and groups together a
+definition of the term with `<def/>`, usage examples with `<usg/>` (another use
+of this versatile element) and other high-level information such as translations
+in other languages. Both `<def/>` and `<usg/>` elements may appear directly
+under the `<entry/>`.
+
+###### Structural remarks
+
+Before concluding this description of the *dictionaries* module from the
+perspective of someone trying to concretely encode a particular dictionary or
+encyclopedia, we make use of the graph approach again to evidence some its
+aspects in terms of inclusion structure.
+
+First, it is remarkable that all elements in the *dictionaries* module have a
+cyclic inclusion path, that is to say, there is an inclusion path from each
+element of this module to itself. Although having such a cycle is a widespread
+property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
+of the 557 elements in the other modules), all 33 elements of the *dictionaries*
+module having one is far above this average. In addition, the cycles appear to
+be rather short, with an average length of 2.00 versus 2.50 in the rest of the
+population. This observation is all the more surprising considering the fact
+that the *dictionaries* module contains short "leaf" elements like `<pos/>`
+which should not obviously need to admit cycles since one rather expects them to
+contain only one word, like `<pos>adj</pos>` in the example given in the
+official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
+element made to group quotations with a bibliographic reference to their source
+which should clearly be unnecessary to encode an article in the general case.
+
+Secondly, although we have seen examples of connections from this module to the
+rest of the XML-TEI, especially to the *core* module (see the case of the
+`<ref/>` element above), the *dictionaries* module appears somewhat isolated
+from important structural elements like `<head/>` or `<div/>`. Indeed, computing
+all the paths from either `<entry/>` or `<sense/>` elements to the latter of
+length shorter or equal to 5 by a systematic traversal of the graph yields
+exclusively paths (respectively 9042 and 39093 of them) containing either a
+`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
+suggests, is used to encode text that does not quite fit the regular flow of the
+document, as for example in the context of an embedded narrative. Both examples
+displayed in the online documentation feature a `<body/>` as direct child of
+`<floatingText/>`, neatly separating its content as independent. The purpose of
+the second one, although its name — short for apparatus — is less clear, is to
+wrap together several versions of the same excerpts, for instance when there are
+several possible readings of an unclear group of words in a manuscript, or when
+the encoder is trying to compile a single version of a piece of work from
+several sources which disagree over some passage. In both case, it appears
+obvious that it is not something that is expected to occur naturally in the
+course of an article in general.
+
+Thus, despite a rather dense internal connectivity, the *dictionaries* module
+fails to provide encoders with a device to represent recursively nesting
+structures like `<div/>`.
+
+The situation regarding subject indicators is hardly better outside of the
+module. The `<domain/>` element despite its name belongs exclusively in the
+header of a document and focuses on the social context of the text, not on the
+knowledge area it covers. The `<interp/>` despite its name is not so much about
+labeling something as an interpretation to give to a context (which subject
+indicators could be if you consider that, placed at the beginning, they are used
+to direct the mind frame of the readers towards a particular subject). However,
+the documentation clearly demonstrates it as a tool for annotators of a
+document, which text content is not part of the original document but some
+additional result of an analysis performed in the context of the encoding, used
+only throughout references in XML attributes.
+
+This point, although not the most concerning, still remains the hardest to
+address but all things considered the `<usg/>` element stands out as the most
+relevant.
+
+###### The notion of meaning
+
+Notwithstanding the correct way to represent domains of knowledge, their extent
+itself raises concerns regarding the *dictionaries* module. Indeed, among the
+vast collection of domains covered in encyclopedias in general and in *La Grande
+Encyclopédie* in particular are historical articles and biographies. If the
+notion of meaning can appear at least ill-fitting for a text describing a series
+of historical events, one may still argue that it groups them into a concept and
+associates it to the name of the event. But when it comes to relating the life
+of a person, describing their relation to events and other persons comes out
+even further from the notion of meaning. Entries such as the one about SANJO
+Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
+
+![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29](figure/article/LGE/sanjo_t29.png){#fig:sanjo}
+
+Moreover, encyclopedias, because of all that they have inherited from the
+philosophical Enlightenment, are not only spaces designed to assert, they also
+intrinsically include an interrogative component. Some articles lay down the
+basis required to understand the complexity of an issue and invite the reader to
+consider it without providing a definitive answer, going as far as to explicitly
+use question marks as in the article "Action" displayed in Figure @fig:action.
+
+![Excerpt from article "Action", in La Grande Encyclopédie, tome 1](figure/article/LGE/action_t1.png){#fig:action}
+
+In this extract, the author devises a hypothetical situation to illustrate how
+difficult it is to draw the line between two supposedly mutually exclusive
+subcategories of legal actions. The whole point of the passage is to convey the
+idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
+`<def/>` element would be an utter misnomer.
+
+As a result, the use of `<sense/>` and `<def/>` is not appropriate for
+encyclopedic content in general.
+
+###### Nested structures
+
+The final difficulty can be considered as a partial consequence of the previous
+one on the structure of articles. The difficulty to define complex concepts is
+the very reason why authors approach their subjects from various angles,
+circumnavigating it as a best approximation. This strategy favours long,
+structured developments with sections and subsections covering the multiple
+aspects of the topic: from a historical, political, scientific point of view…
+The longest articles, such as article "Europe" shown in Figure @fig:europe, can
+thus span several dozens of pages. They can contain substructures with titles on
+at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of
+which are in turn generally developed over several paragraphs.
+
+![La Grande Encyclopédie, tome 16, article "Europe", spanning from p.782 to p.846, that is 64 pages, and ending after a bibliography longer than one column of text](figure/article/LGE/europe_t16.png){#fig:europe}
+
+The nested structure that we have just evidenced demands of course a nesting
+structure to accomodate it. More precisely it guides our search of XML elements
+by giving us several constraints: we are looking for a pair of elements, the
+first representing a (sub)section must be able to include both itself and the
+second element, which does not have any special constraint except the one to
+have a semantics compatible with our purpose of using it to represent section
+titles. In addition, the first element must be able to contain several `<p/>`
+elements, `<p/>` being the reference element to encode paragraphs according to
+the XML-TEI documentation.
+
+We have seen that the *dictionaries* module was equiped with a questionable but
+possible element for subject domains. However, it does not include any element
+for section titles. In the rest of the TEI specification, the elements `<head/>`
+and `<title/>` — the latter with the possibility to set its `type` attribute to
+`sub` — stand out as the best candidates for the semantics condition on the
+second element.
+
+##### Choix
+
+###### Candidates in the *dictionaries* module
+
+Filtering the content of the module to keep only the elements which can at the
+same time contain themselves, be included under `<entry/>` and include a `<p/>`
+and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
+It is remarkable that even replacing the `<entry/>` element for the root of each
+article with an `<entryFree/>`, an element supposed to relax some constraint to
+accomodate more unusual structure in dictionaries does not bring any
+improvement.
+
+The lack of results from these simple queries forces us to somewhat release the
+constraints on the encoding we are willing to use. We can for instance make the
+asumption that the occurrence of an intermediate element could be needed between
+the element wrapping the whole article and the recursing one used to encode each
+section. This "section" element could also need a companion element to be able
+to include itself, or, to formalise it in terms of graph theory, we could relax
+the condition that this element admits a loop to consider instead cycles of a
+given (small, this still needs to represent a fairly direct inclusion) length to
+be enough. We simultaneously extend the maximum depth of the inclusion paths we
+are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
+
+By setting this depth to 3, that is, by accepting one intermediate element to
+occur in the middle of each one of the inclusion paths that define the structure
+required to encode encyclopedic discourse, we find 21 elements but none of them
+stand out as an obvious good solution: all paths to include the `<p/>` element
+from any *dictionaries* element either contains a `<figure/>` (which we have
+encountered earlier when we were practising our graph approach to search for
+inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
+general), a `<stage/>` (reserved to stage direction in dramatic works) or a
+`<state/>` (used to describe a temporary quality in a person or place), again
+not even close to what we want. The paths to either `<head/>` or `<title/>` are
+similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns
+the exact same candidates. If that is not a thorough proof that none of these
+elements could fulfill our purpose, it is a fact than no element in this module
+appears as an obvious good solution and a serious hint to keep looking somewhere
+else.
+
+###### Widening the search
+
+We hence widen our search to include elements outside the *dictionaries* module
+which could be used to encode our sections and subsections, under the same
+constraint as before to try and find a composite solution that would remain
+under the `<entry/>` element even if resorting to subcomponents outside of the
+dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>`
+and `<note/>`.
+
+The first one as we have repeatedly underlined is meant for graphic information
+and is not suitable for text content in general.
+
+The purpose of `<metamark/>` is to transcribe the edition marks than may appear
+on a particular primary source in order to alter the normal flow of the text and
+suggest an alternative reading (deletion, insertion, reordering, this is about a
+human editing the text from a given physical copy of it), but it is
+unfortunately of no use to encode a section of an article.
+
+The first element that might at least resemble what we are looking for is the
+last one, `<note/>`. It is meant to contain text, is about explaning something
+and seems general enough (not specific to a given genre, or to the occurrence of
+a particular object on the page). Unfortunately, its semantics still seems a bit
+off compared to our need. The documentation describes it as an "additional
+comment" which appears "out of the main textual stream" whereas the long
+developments in articles are the very matter of the text of encyclopedias, not
+mere remarks in the margins or at the foot of pages.
+
+##### Implémentation
+
+The above remarks explain why the *dictionary* module is unable to represent
+encyclopedias, where the notion of "meaning" is less central that in
+dictionaries and where discourse with nested structures of arbitrary depth can
+occur. Even composite encodings using elements outside of the *dictionaries*
+module under an `<entry/>` element do not meet our requirements. Since the
+*core* module of course accomodates these structures by means of the `<div/>`,
+`<head/>` and `<p/>` elements which have the additional advantage of carrying
+less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
+using them which we recommend using for other projects aiming at representing
+encyclopedias.
+
+To remain consistent with the above remarks we will only concern ourselves with
+what happens at the level of each article, right under the `<body/>` element.
+Everything related to metadata happens as expected in the file's `<teiHeader/>`
+which is well-enough equiped to handle them. In order to present our scheme
+throughout the following section we will be progressively encoding a reference
+article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
+
+![La Grande Encyclopédie, tome 9, article "Cathète"](figure/article/LGE/cathète_t9.png){#fig:cathete-photo}
+
+###### The scheme
+
+Remaining within the *core* module for the structure, almost all useful elements
+are available and our encoding scheme merely quotes the official documentation.
+Each article is represented by a `<div/>`. We suggest setting an `xml:id`
+attribute on it with the head word of the entry — unique in the whole corpus, or
+made so by suffixing a number representing its rank among the various
+occurrences, even when there's only one for the sake of regularity — as its
+value, normalised to lowercase, stripping spaces and replacing all
+non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
+encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
+element on the article "Cathète" previously displayed.
+
+![The container `div` element for article "Cathète"](figure/article/LGE/cathète_0.png){#fig:cathete-xml-0}
+
+Inside this element should be a `<head/>` enclosing the headword of the article.
+The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
+highlighted by any special typographic means such as bold, small capitals, etc.
+The one disappointment of the encoding scheme we are defining in this chapter is
+the lack of support for a proper way to encode subject indicators.
+
+The best candidate we have found so far was `<usg/>` from the *dictionaries*
+module but it is not available directly under a `<head/>` element. All inclusion
+paths from the latter to the former of length less than or equal to 3 contain
+irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
+must be discarded. The next best elements appear to be `<term/>` (not very
+accurate) and `<rs/>` ("referring string", quite a general semantics but a
+possible match — subject indicators refer to a given domain of knowledge —
+although all the examples in the documentation refer to concrete persons,
+places or object, not to the abstract objects that mathematics or poetry are).
+
+For this reason, we do not recommend any special encoding of the subject
+indicator but leave it open to each particular context: they are often
+abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
+are not labeled by a knowledge domain but usually include the first name of the
+person when it is known so in that case an element like `<persName/>` is still
+appropriate. This choice applied to the same article "Cathète" produces Figure
+@fig:cathete-xml-1.
+
+![Encoding the head word of article "Cathète"](figure/article/LGE/cathète_1.png){#fig:cathete-xml-1}
+
+We then propose to wrap each different meaning in a separate `<div/>` with the
+`type` attribute set to `sense` to refer to the `<sense/>` element that would
+have been used within the *core* module. The `<div/>`s should be numbered
+according to the order they appear in with the `n` attribute starting from `0`
+as shown in Figure @fig:cathete-xml-2.
+
+![The empty structure for the only meaning of the word "Cathète"](figure/article/LGE/cathète_2.png){#fig:cathete-xml-2}
+
+In addition, each line within the article must start with a `<lb/>` to mark its
+beginning including before the `<head/>` element as demonstrated by Figure
+@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that
+in the dense layout of encyclopedias, the carriage return separating two
+articles is meaningful. Stating each new line explicitly keeps enough
+information to reconstruct a faithful facsimile but it also has the advantage of
+highlighting the fact than even though the definition is cut from the headword
+by being in a separate XML element, they still occur on the same line, which is
+a typographic choice usually made both in encyclopedias and dictionaries where
+space is at a premium. .
+
+To complete the structure, the various sections and subsections occurring
+within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
+filled with `<p/>` for paragraphs which can each be titled with `<head/>`
+elements local to each `<div/>`.
+
+![A complete encoding of article "Cathète"](figure/article/LGE/cathète_3.png){#fig:cathete-xml-3}
+
+Some articles such as "Boumerang" have figures with captions, as illustrated by
+Figure @fig:boumerang-photo, which should be encoded the standard way by
+`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml.
+
+![La Grande Encyclopédie, tome 7, article "Boumerang"](figure/article/LGE/boumerang_t7.png){height=300px #fig:boumerang-photo}
+
+![Encoding the figure in article "Boumerang" and its captions](figure/article/LGE/boumerang.png){#fig:boumerang-xml}
+
+Another issue arising from giving up on `<entry/>` is the unavailability of the
+`<xr/>` element, not allowed under any of the *core* elements we use but which
+is useful to represent cross-references occurring in encyclopedias as well as in
+dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
+We prefer to use the `<ref/>` element instead which is available in the context
+of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
+article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
+Another solution would have been to introduce a `<dictScrap/>` element for the
+sole purpose of placing an `<xr/>` but we advocate against it on account of the
+verbosity it would add to the encoding and the fact that it implicitly suggests
+that the previous context was not the one of a dictionary.
+
+![La Grande Encyclopédie, tome 18, article "Gelocus"](figure/article/LGE/gelocus_t18.png){#fig:gelocus-photo}
+
+![Encoding the cross-references in article "Gelocus"](figure/article/LGE/gelocus.png){#fig:gelocus-xml}
+
+A typical page of an encyclopedia also features peritext elements, giving
+information to the reader about the current page number along with the headwords
+of the first and last articles appearing on the page. Those can be encoded by
+`<fw/>` elements ("forme work") which `place` and `type` attributes should be
+set to position them on the page and identify their function if it has been
+recognised (those short elements on the border of pages are the ones typically
+prone to suffer damages or be misread by the OCR).
+
+Finally there are other TEI elements useful to represent "events" in the flow of
+the text, like the beginning of a new column of text or of a new page. Figure
+@fig:alcala-photo shows the top left of the last page of the first tome of *La
+Grande Encyclopédie* which features peritext elements while marking the
+beginning of a new page. The usual appropriate elements (`<pb/>` for page
+beginning, `<cb/>` for column beginning) may and should be used with our
+encoding scheme as demonstrated by Figure @fig:alcala-xml.
+
+![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](figure/article/LGE/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
+
+![Encoding the beginning of a page in article "Alcala-de-Hénarès"](figure/article/LGE/alcala.png){#fig:alcala-xml}
+
+###### Currently implemented
+
+The reference implementation for this encoding scheme is the program
+soprano[^soprano] developed within the scope of project DISCO-LGE to
+automatically identify individual articles in the flow of raw text from the
+columns and to encode them into XML-TEI files. Though this software has already
+been used to produce the first TEI version of *La Grande Encyclopédie*, it does
+not yet follow the above specification perfectly. Figure
+@fig:cathete-xml-current shows the encoded version of article "Cathète" it
+currently produces:
+
+[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
+
+![The current encoding of article "Cathète" produced by `soprano`](figure/article/LGE/cathète_current.png){#fig:cathete-xml-current}
+
+The headword detection system is not able to capture the subject indicators yet
+so it appears outside of the `<head/>` element. No work is performed either to
+expand abbreviations and encode them as such, or to distinguish between domain
+and people names.
+
+Likewise, since the detection of titles at the beginning of each section is not
+complete, no structure analysis can be performed at the moment on the textual
+development inside the article and it is left unstructured, directly under the
+entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
+paragraphs are not yet identified and for this reason not encoded.
+
+However, the figures and their captions are already handled correctly when they
+occur. The encoder also keeps track of the current lines, pages, and columns and
+inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
+numbers pages so that the numbering corresponding to the physical pages are
+available, as compared to the "high-level" pages numbers inserted by the
+editors, which start with an offset because the first, blank or almost empty
+pages at the beginning of each book do not have a number and which sometimes have
+gaps when a full-page geographical map is inserted since those are printed
+separately on a different folio which remains outside of the textual numbering
+system. The place at which these layout-related elements occur is determined by
+the place where the OCR software detected them and by the reordering performed
+by `soprano` when inferring the reading order before segmenting the articles.
+
+###### The constraints of automated processing
+
+Encyclopedias are particularly long books, spanning numerous tomes and
+containing several tenths of thousands of articles. The *Encyclopédie* comprises
+over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
+version produced by `soprano` created 160k articles, but their segmentation is
+still not perfect and if some article beginning remain undetected, all the very
+long and deeply-structured articles are unduly split into many parts, resulting
+globally in an overestimation of the total number).
+
+XML-TEI is a very broad tool useful for very different applications. Some
+elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
+information (for the second one, adjacent to a notion as elusive as truth)
+which requires a very deep understanding of a text in its entirety and about
+which even some human experts may disagree.
+
+For these reasons, a central concern in the design of our encoding scheme was to
+remain within the boundaries of information that can be described objectively
+and extracted automatically by an algorithm. Most of the tags presented above
+contain information about the positions of the elements or their relation to one
+another. Those with an additional semantics implication like `<head/>` can be
+inferred simply from their position and the frequent use of a special typography
+like bold or upper-case characters.
+
+The case of cross-references is particular and may appear as a counter-example
+to the main principle on which our scheme is based. Actually, the process of
+linking from an article to another one is so frequent (in dictionaries as well
+as in encyclopedias) that it generally escapes the scope of regular discourse to
+take a special and often fixed form, inside parenthesis and after a special
+token which invites the reader to perform the redirection. In *La Grande
+Encyclopédie*, virtually all the redirections (that is, to the extent of our
+knowledge, absolutely all of them though of course some special case may exist,
+but they are statistically rare enough that we have not found any yet) appear
+within parenthesis, and start with the verb "voir" abbreviated as a single,
+capital "V." as illustrated above in the article "Gelocus" (see again Figure
+@fig:gelocus-photo).
+
+Although this has not been implemented yet either, we hope to be able to detect
+and exploit those patterns to correctly encode cross-references. Getting the
+`target` attributes right is certainly more difficult to achieve and may require
+processing the articles in several steps, to first discover all the existing
+headwords — and hence article IDs — before trying to match the words following
+"V." with them. Since our automated encoder handles tomes separately and since
+references may cross the boundaries of tomes, it cannot wait for the target of a
+cross-reference to be discovered by keeping the articles in memory before
+outputting them.
+
+This is in line with the last important aspect of our encoder. If many
+lexicographers may deem our encoding too shallow, it has the advantage of not
+requiring to keep too complex datastructures in memory for a long time. The
+algorithm implementing it in `soprano` outputs elements as soon as it can, for
+instance the empty elements already discussed above. For articles, it pushes
+lines onto a stack and flushes it each time it encounters the beginning of the
+following article. This allows the amount of memory required to remain
+reasonable and even lets them be parallelised on most modern machines. Thus,
+even taking over three minutes per tome, the total processing time can be
+lowered to around forty minutes on a machine with 16Go of RAM for the whole of
+*La Grande Encyclopédie* instead of over one hour and a half.
+
+
--- a/Corpus/text.sh
+++ b/Corpus/text.sh
+#!/bin/sh
+
+source ./chapter.sh 'Préparation et enrichissement du corpus'
+
+cat Corpus/Formats_et_états.md
+cat Corpus/Domaines.md
+cat Corpus/Annotation.md
--- a/Glossaire/OCR.md
+++ b/Glossaire/OCR.md
+OCR
+
+:	*Optical Character Recognition*, reconnaissance optique de caractères, est
+le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de
+caractères compréhensibles par la machine et traitables ensuite par des moyens
+automatiques, à partir d'une image.
+
--- a/Glossaire.md
+++ b/Glossaire.md
-# Glossaire {-}
-
-OCR
-
-:	*Optical Character Recognition*, reconnaissance optique de caractères, est
-le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de
-caractères compréhensibles par la machine et traitables ensuite par des moyens
-automatiques, à partir d'une image.
-
 OLR

 :	*Optical Layout Recognition*, reconnaissance optique de la disposition de la

--- a/Glossaire/text.sh
+++ b/Glossaire/text.sh
+#!/bin/sh
+
+[ -n "${HEADER_INCLUDED}" ] || source ./header.sh 2
+
+echo '# Glossaire {-}'
+
+cat Glossaire/OCR.md
+cat Glossaire/OLR.md
--- a/Géographie/Contours.md
+++ b/Géographie/Contours.md
+## Tracer le contours de la géographie
+
+### Établir une correspondance
+
+Empiriquement:
+    + avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés
+    + machine learning
+
+### La biographie cachée
+
+
--- a/Géographie/ENE.md
+++ b/Géographie/ENE.md
+## Entités Nommées Étendues
+
+Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022
+
+### Travaux sur les GNNs
+
+Qu'est-ce qu'on en a retiré ?
+
+
--- a/Géographie.md
+++ b/Géographie.md
-# Identifier et problématiser la géographie
-
-## Relation entre spatial et géographique
-
-> questionnement d'une frontière même
-
-(structuration de la géographie)
-
-## Tracer le contours de la géographie
-
-### Établir une correspondance
-
-Empiriquement:
-    + avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés
-    + machine learning
-
-### La biographie cachée
-
-## Variété des genres discursifs au sein des articles
-
-
 ## Relations entre les domaines de connaissances

 ### Erreurs de classification
@@ -735,11 +714,4 @@ differences we have underlined show that size alone cannot explain their
 distribution in detail. The model does seem to identify some classes
 more easily because of distinctive lexical patterns.

-## Entités Nommées Étendues
-
-Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022
-
-### Travaux sur les GNNs
-
-Qu'est-ce qu'on en a retiré ?

--- a/Géographie/Spatial_et_géographie.md
+++ b/Géographie/Spatial_et_géographie.md
+## Relation entre spatial et géographique
+
+-> questionnement d'une frontière même
+
+(structuration de la géographie)
+
+
--- a/Géographie/Variété_des_genres_discursifs.md
+++ b/Géographie/Variété_des_genres_discursifs.md
+## Variété des genres discursifs au sein des articles
+
+
+