diff --git a/Bibliographie.md b/Bibliographie.md
deleted file mode 100644
index b87ff7921d254227b9893be02a7b3f06ceec9b55..0000000000000000000000000000000000000000
--- a/Bibliographie.md
+++ /dev/null
@@ -1 +0,0 @@
-# Bibliography
diff --git a/Conclusion.md b/Conclusion.md
deleted file mode 100644
index 7d9035eb20b43744c18e25c5892274f2fdb8252f..0000000000000000000000000000000000000000
--- a/Conclusion.md
+++ /dev/null
@@ -1,5 +0,0 @@
-# Conclusion {-}
-
-## Regrets
-
-## Souhaits
diff --git a/Conclusion/Regrets.md b/Conclusion/Regrets.md
new file mode 100644
index 0000000000000000000000000000000000000000..937b53d7d56191395d49431b11f9a6ff7f91bbc2
--- /dev/null
+++ b/Conclusion/Regrets.md
@@ -0,0 +1,3 @@
+## Regrets
+
+
diff --git a/Conclusion/Souhaits.md b/Conclusion/Souhaits.md
new file mode 100644
index 0000000000000000000000000000000000000000..f65313cc8c3f75ff70ebf890ddf4c6a2f85bb527
--- /dev/null
+++ b/Conclusion/Souhaits.md
@@ -0,0 +1,2 @@
+## Souhaits
+
diff --git a/Conclusion/text.sh b/Conclusion/text.sh
new file mode 100755
index 0000000000000000000000000000000000000000..795bfb5fc6a46469fb6ca41a8c79ead71568832b
--- /dev/null
+++ b/Conclusion/text.sh
@@ -0,0 +1,6 @@
+#!/bin/sh
+
+source ./chapter.sh 'Conclusion {-}'
+
+cat Conclusion/Regrets.md
+cat Conclusion/Souhaits.md
diff --git "a/Contrastes/Centralit\303\251.md" "b/Contrastes/Centralit\303\251.md"
new file mode 100644
index 0000000000000000000000000000000000000000..52206935d34835daab5e3d2d2b9e0f8738591919
--- /dev/null
+++ "b/Contrastes/Centralit\303\251.md"
@@ -0,0 +1,7 @@
+## Statistiques
+
+### Mesure de centralité
+
+(DKE)
+
+
diff --git a/Contrastes.md "b/Contrastes/Lexicom\303\251trie.md"
similarity index 63%
rename from Contrastes.md
rename to "Contrastes/Lexicom\303\251trie.md"
index b0c0d1c3dc143ab6554f878b7c4e28c656e8fed3..1338f67480e7986369542ea6c493fbbf9f6b83f2 100644
--- a/Contrastes.md
+++ "b/Contrastes/Lexicom\303\251trie.md"
@@ -1,6 +1,3 @@
-
-# Études contrastives
-
 ## Analyse lexico-grammaticale (Lexicométrie, Textométrique, ?…)
 
 ### Contrastes Internes
@@ -19,11 +16,4 @@ Np vs. Nc
 
 #### Adjectifs préférés
 
-## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents
-
-## Statistiques
-
-### Mesure de centralité
-
-(DKE)
 
diff --git "a/Contrastes/Phras\303\251ologie.md" "b/Contrastes/Phras\303\251ologie.md"
new file mode 100644
index 0000000000000000000000000000000000000000..6f1f4c2391b76c241cf8dc9a9291fe9eed0e1d7d
--- /dev/null
+++ "b/Contrastes/Phras\303\251ologie.md"
@@ -0,0 +1,3 @@
+## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents
+
+
diff --git a/Contrastes/text.sh b/Contrastes/text.sh
new file mode 100755
index 0000000000000000000000000000000000000000..b688ad4fbb72bcbd00aebb622143d7b7837782dc
--- /dev/null
+++ b/Contrastes/text.sh
@@ -0,0 +1,7 @@
+#!/bin/sh
+
+source ./chapter.sh 'Études contrastives'
+
+cat Contrastes/Lexicométrie.md
+cat Contrastes/Phraséologie.md
+cat Contrastes/Centralité.md
diff --git a/Corpus/Annotation.md b/Corpus/Annotation.md
new file mode 100644
index 0000000000000000000000000000000000000000..a4ed6cebca3a0d92445d78e8c1449cbe30ec4dcf
--- /dev/null
+++ b/Corpus/Annotation.md
@@ -0,0 +1,17 @@
+## Annotation en parties de discours et syntaxe
+
+### Jeu d'étiquettes
+
+Nous utilisons le [jeu d'étiquettes]() du projet
+[PRESTO](http://presto.ens-lyon.fr/)
+
+Alors non en fait Stanza c'est bien aussi avec les
+[UPOS](https://universaldependencies.org/docs/u/pos/)
+
+### Chaînes de traitement
+
+- PRESTO
+- Stanza
+
+
+
diff --git a/Corpus.md b/Corpus/Domaines.md
similarity index 50%
rename from Corpus.md
rename to Corpus/Domaines.md
index b36424d96699ab037a1aff74c5a9a6688cf300f8..ad3226dba65f4207dcc1915b567054d8068451e9 100644
--- a/Corpus.md
+++ b/Corpus/Domaines.md
@@ -1,726 +1,3 @@
-# Préparation et enrichissement du corpus
-
-## Formats et états des textes
-
-### L'Encyclopédie
-
-In common parlance, the terms "dictionaries" and "encyclopedias" are used as
-near synonyms to refer to books compiling vast amounts of knowledge into lists
-of definitions ordered alphabetically. Their similarity is even visible in the
-way they are coordinated in the full title of the *Encyclopédie* which is
-probably the most famous work of the genre and a symbol of the Age of
-Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it
-was much more unusual and in fact controversial when Diderot and d'Alembert
-decided to use it in the title of their book.
-
-The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
-still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
-"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
-by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
-to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
-Encyclopedia"). At the time the word still mostly refers to the abstract concept
-of mastering all knowledges at once. Furetière adds that it's a quality one
-is unlikely to possess, and even seems to condemn its search as a form of
-hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
-("it is a recklessness for a man to want to possess Encyclopedia").
-
-Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
-at the end of the 17^th^ century and attacked in the
-*Dictionnaire Universel François et Latin*, commonly refered to as the
-*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
-"Encyclopédie" remained unchanged in the four editons issued between 1721 and
-1752, mocking the use of the word and discouraging his readers to pursue it. In
-that intent, he quotes a poem from Pibrac encouraging people to specialise in
-only one discipline lest they should not reach perfection, based on an
-argumentation that resembles the saying "Jack of all trades, master of none". It
-is all the more interesting that the definition remains unaltered until 1752,
-one year after the publication of the first volume of the *Encyclopédie*. The
-Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
-*Encyclopédie* which they managed to get banned the same year by the Council of
-State on the charge of attempting to destroy the royal authority, inspiring
-rebellion and corrupting morality in general. There is much more at stake than
-words here, but the attempt to deprecate the word itself is part of their fight
-against the philosophers of the Enlightenment.
-
-The attacks do not remain ignored by Diderot who starts the very definition of
-the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
-directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
-mere self-doubt that their authors should not generalise to anyone, then leaves
-the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues
-that a collaborative work can achieve much more than any talented man could:
-what could possibly not be within reach of a single man, within a single
-lifetime may be achieved by a common effort throughout generations.
-
-History hints that Diderot's opponents took his defence of the feasability of
-the project quite seriously, considering the fact that they got the
-*Encyclopédie*'s privileges to be revoked again six years after its publication
-was resumed [@moureau2001]. As a consequence, the remaining ten volumes
-containing the text of the articles had to be published illegally until 1765,
-thanks to the secret protection of Malesherbes who — despite being head of royal
-censorship — saved the manuscripts from destruction. They were printed secretly
-outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
-Following the high demand from the booksellers who feared they would lose the
-money they had invested in the project, a special privilege was issued for the
-volumes containing the plates, which were released publicly from 1762 to 1772.
-
-In any case, in their last edition in 1771 the authors of the *Dictionnaire de
-Trevoux* had no choice but to acknowledge the success of the encyclopedic
-projects of the 18^th^ century. In this version, the definition
-was entirely reworked, mildly stating that good encyclopedias are difficult to
-make because of the amount of knowledge necessary and work needed to keep up
-with scientific progress instead of calling the effort a parody. It credits
-Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
-though quite explicitly to Diderot and d'Alembert's project by naming the
-collective "Une Société de gens de Lettres" and writing that it started in 1751.
-Even more importantly, two new entries were added after it: one for the
-adjective "encyclopédique" and another one for the noun "encyclopédiste",
-silently admitting how the project had changed its time and the relation to
-knowledge itself.
-
-#### Contexte de l'œuvre
-
-#### Versions disponibles
-
-L'ARTFL[^ARTFL] en propose une version.
-
-[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)
-
-#### Traitements
-
-### La Grande Encyclopédie
-
-#### Contexte de l'œuvre
-
-*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des
-Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut
-publiée en France entre 1885 et 1902 par une équipe de plus de deux cent
-spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ
-1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise
-encyclopédique française majeure à marcher dans les traces de l'ancêtre
-prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt.
-
-Le titre complet de l'œuvre, déjà, montre sa volonté de filiation avec l'*EDdA*,
-volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022].
-
-#### Versions disponibles
-
-Une version numérique de cette œuvre a été réalisée par la BnF et mise en
-ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition
-originale, elle comprend une image par page de l'œuvre, numérisée en niveau de
-gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été
-tirée une version partielle du texte par application d'un programme de
-reconnaissance optique de caractères ([@=OCR]). Cette version présente un
-certains nombre de limite qui empêchait de mener une étude intégrale du texte
-par des moyens automatiques comme la textométrie.
-
-[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071)
-
-D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés,
-certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun
-texte n'est disponible pour ces volumes sur le site de
-Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive
-mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette
-à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été
-choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas
-contigus ni au début ni à la fin de l'œuvre. Ensuite, cette version en «texte
-brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une
-annotation très superficielle et n'est en particulier par segmentée en article.
-Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque
-l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener
-une étude contrastive en groupant les articles par domaine de connaissance ou
-par auteur que d'observer la structure des domaines en comparant entre deux
-encyclopédies quels articles ont été conservés ou non, et le cas échéant si le
-domaine de connaissance qui leur est associé est le même. Enfin, des erreurs
-dans la détection de l'organisation de la page ([@=OLR]) obscurcissent
-significativement le texte en opérant des permutations locales de son contenu
-qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui
-complique nettement la segmentation du texte en article — et dans tous les cas
-endommager la structure des phrases, ce qui est vient introduire des erreurs
-dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que
-nous avons besoin d'appliquer au texte pour faire de la textométrie.
-
-[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#)
-
-Dans le but de pallier à ces défauts, le projet CollEx Persée
-DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en
-partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette
-nouvelle version a été réalisée à partir de photographies d'un exemplaire
-original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal].
-
-[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/)
-[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t)
-[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal)
-
-Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala],
-le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num,
-d'une nouvelle version de l'œuvre sous différents formats.
-
-[^nakala]: [https://nakala.fr/](https://nakala.fr/)
-
-#### Encodage
-
-##### Structure du module *dictionaries*
-
-**Definitions**
-
-By iterating several times the operation of moving on that graph along one edge,
-that is, by considering the transitive closure of the relation "be connected by
-an edge" we define *inclusion paths* which allow us to explore which elements
-may be nested under which other.
-
-The nodes visited along the way represent the intermediate XML elements to
-construct a valid XML tree according to the TEI schema. Given the top-down
-semantics of those trees, we call the length of an inclusion path its *depth*.
-
-The ability for an element to contain itself corresponds directly to loops on
-the graph (that is an edge from a node to itself) as can be illustrated by the
-`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
-another one.
-
-The generalisation of this to inclusion paths of any length greater than one is
-usually called a cycle and we may be tempted in our context to refine this and
-name them *inclusion cycles*. The `<address/>` element provides us with an
-example for this configuration: although an `<address/>` element may not
-directly contain another one, it may contain a `<geogName/>` which, in turn, may
-contain a new `<address/>` element. From a graph theory perspective, we can say
-that it admits an inclusion cycle of length two.
-
-**Applications**
-
-Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
-allows us to explore the shortest inclusion paths that exist between elements.
-Though a particular caution should be applied because there is no guarantee that
-the shortest path is meaningful in general, it at least provides us with an
-efficient way to check whether a given element may or not be nested at all under
-another one and gives a lower bound on the length of the path to expect. Of
-course the accuracy of this heuristic decreases as the length of the elements
-increases in the perfect graph representing the intended, meaningful path
-between two nodes that a human specialist of the TEI framework could build.
-
-This is still very useful when taking into account the fact that TEI modules are
-merely "bags" to group the elements and provide hints to human encoders about
-the tools they might need but have no implication on the inclusion paths between
-elements which cross module boundaries freely. The general graph formalism
-enables us to describe complex filtering patterns and to implement queries to
-look for them among the elements exhaustively by algorithmic means even when the
-shortest-path approach is not enough.
-
-For instance, it lets one find that although `<pos/>` may not be directly
-included within `<entry/>` elements to include information about the
-part-of-speech of the word that an article defines, the correct way to do so is
-through a `<form/>` or a `<gramGrp/>`.
-
-On the other hand, trying to discover the shortest inclusion path to `<pos/>`
-from the `<TEI/>` root of the document yields a `<standOff/>`, an element
-dedicated to store contextual data that accompanies but is not part of the text,
-not unlike an annex, and widely unrelated to the context of encoding an
-encyclopedia.
-
-A last relevant example on the use of these methods can be given by querying the
-shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
-yields an inclusion directly through `<entryFree/>` (with an inclusion path of
-length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
-not what we want depending on the regularity of the articles we are encoding and
-the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
-justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
-length 3 returns as expected the path through `<entry/>`, among others. Overall,
-we get a good general idea: `<pos/>` does not need to be nested very deep, it
-can appear quite near the "surface" of article entries.
-
-##### Limites
-
-###### The `<entry/>` element
-
-The central element of the *dictionaries* module is the `<entry/>` element meant
-to encode one single entry in a dictionary, that is to say a head word
-associated to its definition. It is the natural way in from the `<body/>`
-element to the dictionary module: indeed, although `<body/>` may also contain
-`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
-`<entry/>` while the latter is a device to group several related entries
-together. Both can contain an `<entry/` directly while no obvious inclusion
-exists the other way around: most (> 96.2%) of the inclusion paths of
-"reasonable" depth (which we define as strictly inferior to 5, that is twice the
-average shortest depth between any two nodes) either include `<figure/>` or
-`<castList/>`, two very specific elements which should not need to appear in an
-article in general, showing that the purpose of `<entry/>` is not to contain an
-`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
-documentation but also the structure of the elements graph evidence `<entry/>`
-as the natural top-most element for an article. This somewhat contrived example
-hopes to further demonstrate the application of a graph-centred approach to
-understand the inner workings of the XML-TEI schema.
-
-###### Information about the headword itself
-
-Once a block for an article is created, it may contain elements useful to
-represent various of its features. Its written and spoken forms are usually
-encoded by `<form/>` elements. Grammatical information like the `<case/>`,
-`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`,
-along with information about the categories it belongs to like `<iType/>` for
-its inflection class in languages with a declension system or `<pos/>` for its
-part-of-speech. The `<etym/>` element is made to hold the etymology of an entry.
-In the case when there are alternative spellings in varieties of the language or
-if the spelling has changed over time, `<usg/>` can be used.
-
-All these examples are by no means an exhaustive list; the complete set provides
-the encoder with a toolbox to describe all the information related to the form
-the entry is found at and seems general enough to accomodate the structure of
-any book indexing entries by words.
-
-###### Cross-references
-
-A common feature shared by dictionaries and encyclopedias is the ability to
-connect entries together by using a word or short phrase as the link, referring
-the reader to the related concept. This is known as cross-references and can
-appear either when the definition of a term is adjacent to another one or to
-catch alternative spellings where some readers might expect to find the word and
-redirect them to the form chosen as the reference. In XML-TEI, this is done with
-the `<xr/>` element. It usually contains the whole phrase performing the
-redirection, with an imperative locution like "please see […]".
-
-The "active" part of the cross-reference, that is the very word within the
-`<xr/>` that is considered to be the link or, to make a modern-day HTML
-metaphor, the region that would be clickable, is represented by a `<ref/>`
-element. Though it is not specific to the *dictionaries* module, we include it
-in this description of the toolbox because it is particularly useful in the
-context of dictionaries. This element may have a target attribute which points
-to the other resource to be accessed by the interested reader.
-
-###### Definitions
-
-The remaining part of entries is also usually the largest and represents the
-content associated to the headword by the entry. In a dictionary, that is its
-meaning.
-
-The `<sense/>` element is a valid child for `<entry/>` and groups together a
-definition of the term with `<def/>`, usage examples with `<usg/>` (another use
-of this versatile element) and other high-level information such as translations
-in other languages. Both `<def/>` and `<usg/>` elements may appear directly
-under the `<entry/>`.
-
-###### Structural remarks
-
-Before concluding this description of the *dictionaries* module from the
-perspective of someone trying to concretely encode a particular dictionary or
-encyclopedia, we make use of the graph approach again to evidence some its
-aspects in terms of inclusion structure.
-
-First, it is remarkable that all elements in the *dictionaries* module have a
-cyclic inclusion path, that is to say, there is an inclusion path from each
-element of this module to itself. Although having such a cycle is a widespread
-property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
-of the 557 elements in the other modules), all 33 elements of the *dictionaries*
-module having one is far above this average. In addition, the cycles appear to
-be rather short, with an average length of 2.00 versus 2.50 in the rest of the
-population. This observation is all the more surprising considering the fact
-that the *dictionaries* module contains short "leaf" elements like `<pos/>`
-which should not obviously need to admit cycles since one rather expects them to
-contain only one word, like `<pos>adj</pos>` in the example given in the
-official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
-element made to group quotations with a bibliographic reference to their source
-which should clearly be unnecessary to encode an article in the general case.
-
-Secondly, although we have seen examples of connections from this module to the
-rest of the XML-TEI, especially to the *core* module (see the case of the
-`<ref/>` element above), the *dictionaries* module appears somewhat isolated
-from important structural elements like `<head/>` or `<div/>`. Indeed, computing
-all the paths from either `<entry/>` or `<sense/>` elements to the latter of
-length shorter or equal to 5 by a systematic traversal of the graph yields
-exclusively paths (respectively 9042 and 39093 of them) containing either a
-`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
-suggests, is used to encode text that does not quite fit the regular flow of the
-document, as for example in the context of an embedded narrative. Both examples
-displayed in the online documentation feature a `<body/>` as direct child of
-`<floatingText/>`, neatly separating its content as independent. The purpose of
-the second one, although its name — short for apparatus — is less clear, is to
-wrap together several versions of the same excerpts, for instance when there are
-several possible readings of an unclear group of words in a manuscript, or when
-the encoder is trying to compile a single version of a piece of work from
-several sources which disagree over some passage. In both case, it appears
-obvious that it is not something that is expected to occur naturally in the
-course of an article in general.
-
-Thus, despite a rather dense internal connectivity, the *dictionaries* module
-fails to provide encoders with a device to represent recursively nesting
-structures like `<div/>`.
-
-The situation regarding subject indicators is hardly better outside of the
-module. The `<domain/>` element despite its name belongs exclusively in the
-header of a document and focuses on the social context of the text, not on the
-knowledge area it covers. The `<interp/>` despite its name is not so much about
-labeling something as an interpretation to give to a context (which subject
-indicators could be if you consider that, placed at the beginning, they are used
-to direct the mind frame of the readers towards a particular subject). However,
-the documentation clearly demonstrates it as a tool for annotators of a
-document, which text content is not part of the original document but some
-additional result of an analysis performed in the context of the encoding, used
-only throughout references in XML attributes.
-
-This point, although not the most concerning, still remains the hardest to
-address but all things considered the `<usg/>` element stands out as the most
-relevant.
-
-###### The notion of meaning
-
-Notwithstanding the correct way to represent domains of knowledge, their extent
-itself raises concerns regarding the *dictionaries* module. Indeed, among the
-vast collection of domains covered in encyclopedias in general and in *La Grande
-Encyclopédie* in particular are historical articles and biographies. If the
-notion of meaning can appear at least ill-fitting for a text describing a series
-of historical events, one may still argue that it groups them into a concept and
-associates it to the name of the event. But when it comes to relating the life
-of a person, describing their relation to events and other persons comes out
-even further from the notion of meaning. Entries such as the one about SANJO
-Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
-
-![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29](figure/article/LGE/sanjo_t29.png){#fig:sanjo}
-
-Moreover, encyclopedias, because of all that they have inherited from the
-philosophical Enlightenment, are not only spaces designed to assert, they also
-intrinsically include an interrogative component. Some articles lay down the
-basis required to understand the complexity of an issue and invite the reader to
-consider it without providing a definitive answer, going as far as to explicitly
-use question marks as in the article "Action" displayed in Figure @fig:action.
-
-![Excerpt from article "Action", in La Grande Encyclopédie, tome 1](figure/article/LGE/action_t1.png){#fig:action}
-
-In this extract, the author devises a hypothetical situation to illustrate how
-difficult it is to draw the line between two supposedly mutually exclusive
-subcategories of legal actions. The whole point of the passage is to convey the
-idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
-`<def/>` element would be an utter misnomer.
-
-As a result, the use of `<sense/>` and `<def/>` is not appropriate for
-encyclopedic content in general.
-
-###### Nested structures
-
-The final difficulty can be considered as a partial consequence of the previous
-one on the structure of articles. The difficulty to define complex concepts is
-the very reason why authors approach their subjects from various angles,
-circumnavigating it as a best approximation. This strategy favours long,
-structured developments with sections and subsections covering the multiple
-aspects of the topic: from a historical, political, scientific point of view…
-The longest articles, such as article "Europe" shown in Figure @fig:europe, can
-thus span several dozens of pages. They can contain substructures with titles on
-at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of
-which are in turn generally developed over several paragraphs.
-
-![La Grande Encyclopédie, tome 16, article "Europe", spanning from p.782 to p.846, that is 64 pages, and ending after a bibliography longer than one column of text](figure/article/LGE/europe_t16.png){#fig:europe}
-
-The nested structure that we have just evidenced demands of course a nesting
-structure to accomodate it. More precisely it guides our search of XML elements
-by giving us several constraints: we are looking for a pair of elements, the
-first representing a (sub)section must be able to include both itself and the
-second element, which does not have any special constraint except the one to
-have a semantics compatible with our purpose of using it to represent section
-titles. In addition, the first element must be able to contain several `<p/>`
-elements, `<p/>` being the reference element to encode paragraphs according to
-the XML-TEI documentation.
-
-We have seen that the *dictionaries* module was equiped with a questionable but
-possible element for subject domains. However, it does not include any element
-for section titles. In the rest of the TEI specification, the elements `<head/>`
-and `<title/>` — the latter with the possibility to set its `type` attribute to
-`sub` — stand out as the best candidates for the semantics condition on the
-second element.
-
-##### Choix
-
-###### Candidates in the *dictionaries* module
-
-Filtering the content of the module to keep only the elements which can at the
-same time contain themselves, be included under `<entry/>` and include a `<p/>`
-and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
-It is remarkable that even replacing the `<entry/>` element for the root of each
-article with an `<entryFree/>`, an element supposed to relax some constraint to
-accomodate more unusual structure in dictionaries does not bring any
-improvement.
-
-The lack of results from these simple queries forces us to somewhat release the
-constraints on the encoding we are willing to use. We can for instance make the
-asumption that the occurrence of an intermediate element could be needed between
-the element wrapping the whole article and the recursing one used to encode each
-section. This "section" element could also need a companion element to be able
-to include itself, or, to formalise it in terms of graph theory, we could relax
-the condition that this element admits a loop to consider instead cycles of a
-given (small, this still needs to represent a fairly direct inclusion) length to
-be enough. We simultaneously extend the maximum depth of the inclusion paths we
-are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
-
-By setting this depth to 3, that is, by accepting one intermediate element to
-occur in the middle of each one of the inclusion paths that define the structure
-required to encode encyclopedic discourse, we find 21 elements but none of them
-stand out as an obvious good solution: all paths to include the `<p/>` element
-from any *dictionaries* element either contains a `<figure/>` (which we have
-encountered earlier when we were practising our graph approach to search for
-inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
-general), a `<stage/>` (reserved to stage direction in dramatic works) or a
-`<state/>` (used to describe a temporary quality in a person or place), again
-not even close to what we want. The paths to either `<head/>` or `<title/>` are
-similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns
-the exact same candidates. If that is not a thorough proof that none of these
-elements could fulfill our purpose, it is a fact than no element in this module
-appears as an obvious good solution and a serious hint to keep looking somewhere
-else.
-
-###### Widening the search
-
-We hence widen our search to include elements outside the *dictionaries* module
-which could be used to encode our sections and subsections, under the same
-constraint as before to try and find a composite solution that would remain
-under the `<entry/>` element even if resorting to subcomponents outside of the
-dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>`
-and `<note/>`.
-
-The first one as we have repeatedly underlined is meant for graphic information
-and is not suitable for text content in general.
-
-The purpose of `<metamark/>` is to transcribe the edition marks than may appear
-on a particular primary source in order to alter the normal flow of the text and
-suggest an alternative reading (deletion, insertion, reordering, this is about a
-human editing the text from a given physical copy of it), but it is
-unfortunately of no use to encode a section of an article.
-
-The first element that might at least resemble what we are looking for is the
-last one, `<note/>`. It is meant to contain text, is about explaning something
-and seems general enough (not specific to a given genre, or to the occurrence of
-a particular object on the page). Unfortunately, its semantics still seems a bit
-off compared to our need. The documentation describes it as an "additional
-comment" which appears "out of the main textual stream" whereas the long
-developments in articles are the very matter of the text of encyclopedias, not
-mere remarks in the margins or at the foot of pages.
-
-##### Implémentation
-
-The above remarks explain why the *dictionary* module is unable to represent
-encyclopedias, where the notion of "meaning" is less central that in
-dictionaries and where discourse with nested structures of arbitrary depth can
-occur. Even composite encodings using elements outside of the *dictionaries*
-module under an `<entry/>` element do not meet our requirements. Since the
-*core* module of course accomodates these structures by means of the `<div/>`,
-`<head/>` and `<p/>` elements which have the additional advantage of carrying
-less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
-using them which we recommend using for other projects aiming at representing
-encyclopedias.
-
-To remain consistent with the above remarks we will only concern ourselves with
-what happens at the level of each article, right under the `<body/>` element.
-Everything related to metadata happens as expected in the file's `<teiHeader/>`
-which is well-enough equiped to handle them. In order to present our scheme
-throughout the following section we will be progressively encoding a reference
-article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
-
-![La Grande Encyclopédie, tome 9, article "Cathète"](figure/article/LGE/cathète_t9.png){#fig:cathete-photo}
-
-###### The scheme
-
-Remaining within the *core* module for the structure, almost all useful elements
-are available and our encoding scheme merely quotes the official documentation.
-Each article is represented by a `<div/>`. We suggest setting an `xml:id`
-attribute on it with the head word of the entry — unique in the whole corpus, or
-made so by suffixing a number representing its rank among the various
-occurrences, even when there's only one for the sake of regularity — as its
-value, normalised to lowercase, stripping spaces and replacing all
-non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
-encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
-element on the article "Cathète" previously displayed.
-
-![The container `div` element for article "Cathète"](figure/article/LGE/cathète_0.png){#fig:cathete-xml-0}
-
-Inside this element should be a `<head/>` enclosing the headword of the article.
-The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
-highlighted by any special typographic means such as bold, small capitals, etc.
-The one disappointment of the encoding scheme we are defining in this chapter is
-the lack of support for a proper way to encode subject indicators.
-
-The best candidate we have found so far was `<usg/>` from the *dictionaries*
-module but it is not available directly under a `<head/>` element. All inclusion
-paths from the latter to the former of length less than or equal to 3 contain
-irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
-must be discarded. The next best elements appear to be `<term/>` (not very
-accurate) and `<rs/>` ("referring string", quite a general semantics but a
-possible match — subject indicators refer to a given domain of knowledge —
-although all the examples in the documentation refer to concrete persons,
-places or object, not to the abstract objects that mathematics or poetry are).
-
-For this reason, we do not recommend any special encoding of the subject
-indicator but leave it open to each particular context: they are often
-abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
-are not labeled by a knowledge domain but usually include the first name of the
-person when it is known so in that case an element like `<persName/>` is still
-appropriate. This choice applied to the same article "Cathète" produces Figure
-@fig:cathete-xml-1.
-
-![Encoding the head word of article "Cathète"](figure/article/LGE/cathète_1.png){#fig:cathete-xml-1}
-
-We then propose to wrap each different meaning in a separate `<div/>` with the
-`type` attribute set to `sense` to refer to the `<sense/>` element that would
-have been used within the *core* module. The `<div/>`s should be numbered
-according to the order they appear in with the `n` attribute starting from `0`
-as shown in Figure @fig:cathete-xml-2.
-
-![The empty structure for the only meaning of the word "Cathète"](figure/article/LGE/cathète_2.png){#fig:cathete-xml-2}
-
-In addition, each line within the article must start with a `<lb/>` to mark its
-beginning including before the `<head/>` element as demonstrated by Figure
-@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that
-in the dense layout of encyclopedias, the carriage return separating two
-articles is meaningful. Stating each new line explicitly keeps enough
-information to reconstruct a faithful facsimile but it also has the advantage of
-highlighting the fact than even though the definition is cut from the headword
-by being in a separate XML element, they still occur on the same line, which is
-a typographic choice usually made both in encyclopedias and dictionaries where
-space is at a premium. .
-
-To complete the structure, the various sections and subsections occurring
-within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
-filled with `<p/>` for paragraphs which can each be titled with `<head/>`
-elements local to each `<div/>`.
-
-![A complete encoding of article "Cathète"](figure/article/LGE/cathète_3.png){#fig:cathete-xml-3}
-
-Some articles such as "Boumerang" have figures with captions, as illustrated by
-Figure @fig:boumerang-photo, which should be encoded the standard way by
-`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml.
-
-![La Grande Encyclopédie, tome 7, article "Boumerang"](figure/article/LGE/boumerang_t7.png){height=300px #fig:boumerang-photo}
-
-![Encoding the figure in article "Boumerang" and its captions](figure/article/LGE/boumerang.png){#fig:boumerang-xml}
-
-Another issue arising from giving up on `<entry/>` is the unavailability of the
-`<xr/>` element, not allowed under any of the *core* elements we use but which
-is useful to represent cross-references occurring in encyclopedias as well as in
-dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
-We prefer to use the `<ref/>` element instead which is available in the context
-of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
-article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
-Another solution would have been to introduce a `<dictScrap/>` element for the
-sole purpose of placing an `<xr/>` but we advocate against it on account of the
-verbosity it would add to the encoding and the fact that it implicitly suggests
-that the previous context was not the one of a dictionary.
-
-![La Grande Encyclopédie, tome 18, article "Gelocus"](figure/article/LGE/gelocus_t18.png){#fig:gelocus-photo}
-
-![Encoding the cross-references in article "Gelocus"](figure/article/LGE/gelocus.png){#fig:gelocus-xml}
-
-A typical page of an encyclopedia also features peritext elements, giving
-information to the reader about the current page number along with the headwords
-of the first and last articles appearing on the page. Those can be encoded by
-`<fw/>` elements ("forme work") which `place` and `type` attributes should be
-set to position them on the page and identify their function if it has been
-recognised (those short elements on the border of pages are the ones typically
-prone to suffer damages or be misread by the OCR).
-
-Finally there are other TEI elements useful to represent "events" in the flow of
-the text, like the beginning of a new column of text or of a new page. Figure
-@fig:alcala-photo shows the top left of the last page of the first tome of *La
-Grande Encyclopédie* which features peritext elements while marking the
-beginning of a new page. The usual appropriate elements (`<pb/>` for page
-beginning, `<cb/>` for column beginning) may and should be used with our
-encoding scheme as demonstrated by Figure @fig:alcala-xml.
-
-![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](figure/article/LGE/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
-
-![Encoding the beginning of a page in article "Alcala-de-Hénarès"](figure/article/LGE/alcala.png){#fig:alcala-xml}
-
-###### Currently implemented
-
-The reference implementation for this encoding scheme is the program
-soprano[^soprano] developed within the scope of project DISCO-LGE to
-automatically identify individual articles in the flow of raw text from the
-columns and to encode them into XML-TEI files. Though this software has already
-been used to produce the first TEI version of *La Grande Encyclopédie*, it does
-not yet follow the above specification perfectly. Figure
-@fig:cathete-xml-current shows the encoded version of article "Cathète" it
-currently produces:
-
-[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
-
-![The current encoding of article "Cathète" produced by `soprano`](figure/article/LGE/cathète_current.png){#fig:cathete-xml-current}
-
-The headword detection system is not able to capture the subject indicators yet
-so it appears outside of the `<head/>` element. No work is performed either to
-expand abbreviations and encode them as such, or to distinguish between domain
-and people names.
-
-Likewise, since the detection of titles at the beginning of each section is not
-complete, no structure analysis can be performed at the moment on the textual
-development inside the article and it is left unstructured, directly under the
-entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
-paragraphs are not yet identified and for this reason not encoded.
-
-However, the figures and their captions are already handled correctly when they
-occur. The encoder also keeps track of the current lines, pages, and columns and
-inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
-numbers pages so that the numbering corresponding to the physical pages are
-available, as compared to the "high-level" pages numbers inserted by the
-editors, which start with an offset because the first, blank or almost empty
-pages at the beginning of each book do not have a number and which sometimes have
-gaps when a full-page geographical map is inserted since those are printed
-separately on a different folio which remains outside of the textual numbering
-system. The place at which these layout-related elements occur is determined by
-the place where the OCR software detected them and by the reordering performed
-by `soprano` when inferring the reading order before segmenting the articles.
-
-###### The constraints of automated processing
-
-Encyclopedias are particularly long books, spanning numerous tomes and
-containing several tenths of thousands of articles. The *Encyclopédie* comprises
-over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
-version produced by `soprano` created 160k articles, but their segmentation is
-still not perfect and if some article beginning remain undetected, all the very
-long and deeply-structured articles are unduly split into many parts, resulting
-globally in an overestimation of the total number).
-
-XML-TEI is a very broad tool useful for very different applications. Some
-elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
-information (for the second one, adjacent to a notion as elusive as truth)
-which requires a very deep understanding of a text in its entirety and about
-which even some human experts may disagree.
-
-For these reasons, a central concern in the design of our encoding scheme was to
-remain within the boundaries of information that can be described objectively
-and extracted automatically by an algorithm. Most of the tags presented above
-contain information about the positions of the elements or their relation to one
-another. Those with an additional semantics implication like `<head/>` can be
-inferred simply from their position and the frequent use of a special typography
-like bold or upper-case characters.
-
-The case of cross-references is particular and may appear as a counter-example
-to the main principle on which our scheme is based. Actually, the process of
-linking from an article to another one is so frequent (in dictionaries as well
-as in encyclopedias) that it generally escapes the scope of regular discourse to
-take a special and often fixed form, inside parenthesis and after a special
-token which invites the reader to perform the redirection. In *La Grande
-Encyclopédie*, virtually all the redirections (that is, to the extent of our
-knowledge, absolutely all of them though of course some special case may exist,
-but they are statistically rare enough that we have not found any yet) appear
-within parenthesis, and start with the verb "voir" abbreviated as a single,
-capital "V." as illustrated above in the article "Gelocus" (see again Figure
-@fig:gelocus-photo).
-
-Although this has not been implemented yet either, we hope to be able to detect
-and exploit those patterns to correctly encode cross-references. Getting the
-`target` attributes right is certainly more difficult to achieve and may require
-processing the articles in several steps, to first discover all the existing
-headwords — and hence article IDs — before trying to match the words following
-"V." with them. Since our automated encoder handles tomes separately and since
-references may cross the boundaries of tomes, it cannot wait for the target of a
-cross-reference to be discovered by keeping the articles in memory before
-outputting them.
-
-This is in line with the last important aspect of our encoder. If many
-lexicographers may deem our encoding too shallow, it has the advantage of not
-requiring to keep too complex datastructures in memory for a long time. The
-algorithm implementing it in `soprano` outputs elements as soon as it can, for
-instance the empty elements already discussed above. For articles, it pushes
-lines onto a stack and flushes it each time it encounters the beginning of the
-following article. This allows the amount of memory required to remain
-reasonable and even lets them be parallelised on most modern machines. Thus,
-even taking over three minutes per tome, the total processing time can be
-lowered to around forty minutes on a machine with 16Go of RAM for the whole of
-*La Grande Encyclopédie* instead of over one hour and a half.
-
 ## Les domaines
 
 ### Systèmes de domaines
@@ -1499,19 +776,4 @@ TODO Comment être plus maligne dans l'association ?
 TODO Grammaire des articles
 
 
-## Annotation en parties de discours et syntaxe
-
-### Jeu d'étiquettes
-
-Nous utilisons le [jeu d'étiquettes]() du projet
-[PRESTO](http://presto.ens-lyon.fr/)
-
-Alors non en fait Stanza c'est bien aussi avec les
-[UPOS](https://universaldependencies.org/docs/u/pos/)
-
-### Chaînes de traitement
-
-- PRESTO
-- Stanza
-
 
diff --git "a/Corpus/Formats_et_\303\251tats.md" "b/Corpus/Formats_et_\303\251tats.md"
new file mode 100644
index 0000000000000000000000000000000000000000..26db38a4fb83c2b36e3e217be08c8e0c87626efc
--- /dev/null
+++ "b/Corpus/Formats_et_\303\251tats.md"
@@ -0,0 +1,722 @@
+## Formats et états des textes
+
+### L'Encyclopédie
+
+In common parlance, the terms "dictionaries" and "encyclopedias" are used as
+near synonyms to refer to books compiling vast amounts of knowledge into lists
+of definitions ordered alphabetically. Their similarity is even visible in the
+way they are coordinated in the full title of the *Encyclopédie* which is
+probably the most famous work of the genre and a symbol of the Age of
+Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it
+was much more unusual and in fact controversial when Diderot and d'Alembert
+decided to use it in the title of their book.
+
+The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
+still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
+"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
+by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
+to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
+Encyclopedia"). At the time the word still mostly refers to the abstract concept
+of mastering all knowledges at once. Furetière adds that it's a quality one
+is unlikely to possess, and even seems to condemn its search as a form of
+hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
+("it is a recklessness for a man to want to possess Encyclopedia").
+
+Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
+at the end of the 17^th^ century and attacked in the
+*Dictionnaire Universel François et Latin*, commonly refered to as the
+*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
+"Encyclopédie" remained unchanged in the four editons issued between 1721 and
+1752, mocking the use of the word and discouraging his readers to pursue it. In
+that intent, he quotes a poem from Pibrac encouraging people to specialise in
+only one discipline lest they should not reach perfection, based on an
+argumentation that resembles the saying "Jack of all trades, master of none". It
+is all the more interesting that the definition remains unaltered until 1752,
+one year after the publication of the first volume of the *Encyclopédie*. The
+Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
+*Encyclopédie* which they managed to get banned the same year by the Council of
+State on the charge of attempting to destroy the royal authority, inspiring
+rebellion and corrupting morality in general. There is much more at stake than
+words here, but the attempt to deprecate the word itself is part of their fight
+against the philosophers of the Enlightenment.
+
+The attacks do not remain ignored by Diderot who starts the very definition of
+the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
+directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
+mere self-doubt that their authors should not generalise to anyone, then leaves
+the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues
+that a collaborative work can achieve much more than any talented man could:
+what could possibly not be within reach of a single man, within a single
+lifetime may be achieved by a common effort throughout generations.
+
+History hints that Diderot's opponents took his defence of the feasability of
+the project quite seriously, considering the fact that they got the
+*Encyclopédie*'s privileges to be revoked again six years after its publication
+was resumed [@moureau2001]. As a consequence, the remaining ten volumes
+containing the text of the articles had to be published illegally until 1765,
+thanks to the secret protection of Malesherbes who — despite being head of royal
+censorship — saved the manuscripts from destruction. They were printed secretly
+outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
+Following the high demand from the booksellers who feared they would lose the
+money they had invested in the project, a special privilege was issued for the
+volumes containing the plates, which were released publicly from 1762 to 1772.
+
+In any case, in their last edition in 1771 the authors of the *Dictionnaire de
+Trevoux* had no choice but to acknowledge the success of the encyclopedic
+projects of the 18^th^ century. In this version, the definition
+was entirely reworked, mildly stating that good encyclopedias are difficult to
+make because of the amount of knowledge necessary and work needed to keep up
+with scientific progress instead of calling the effort a parody. It credits
+Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
+though quite explicitly to Diderot and d'Alembert's project by naming the
+collective "Une Société de gens de Lettres" and writing that it started in 1751.
+Even more importantly, two new entries were added after it: one for the
+adjective "encyclopédique" and another one for the noun "encyclopédiste",
+silently admitting how the project had changed its time and the relation to
+knowledge itself.
+
+#### Contexte de l'œuvre
+
+#### Versions disponibles
+
+L'ARTFL[^ARTFL] en propose une version.
+
+[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)
+
+#### Traitements
+
+### La Grande Encyclopédie
+
+#### Contexte de l'œuvre
+
+*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des
+Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut
+publiée en France entre 1885 et 1902 par une équipe de plus de deux cent
+spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ
+1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise
+encyclopédique française majeure à marcher dans les traces de l'ancêtre
+prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt.
+
+Le titre complet de l'œuvre, déjà, montre sa volonté de filiation avec l'*EDdA*,
+volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022].
+
+#### Versions disponibles
+
+Une version numérique de cette œuvre a été réalisée par la BnF et mise en
+ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition
+originale, elle comprend une image par page de l'œuvre, numérisée en niveau de
+gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été
+tirée une version partielle du texte par application d'un programme de
+reconnaissance optique de caractères ([@=OCR]). Cette version présente un
+certains nombre de limite qui empêchait de mener une étude intégrale du texte
+par des moyens automatiques comme la textométrie.
+
+[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071)
+
+D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés,
+certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun
+texte n'est disponible pour ces volumes sur le site de
+Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive
+mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette
+à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été
+choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas
+contigus ni au début ni à la fin de l'œuvre. Ensuite, cette version en «texte
+brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une
+annotation très superficielle et n'est en particulier par segmentée en article.
+Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque
+l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener
+une étude contrastive en groupant les articles par domaine de connaissance ou
+par auteur que d'observer la structure des domaines en comparant entre deux
+encyclopédies quels articles ont été conservés ou non, et le cas échéant si le
+domaine de connaissance qui leur est associé est le même. Enfin, des erreurs
+dans la détection de l'organisation de la page ([@=OLR]) obscurcissent
+significativement le texte en opérant des permutations locales de son contenu
+qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui
+complique nettement la segmentation du texte en article — et dans tous les cas
+endommager la structure des phrases, ce qui est vient introduire des erreurs
+dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que
+nous avons besoin d'appliquer au texte pour faire de la textométrie.
+
+[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#)
+
+Dans le but de pallier à ces défauts, le projet CollEx Persée
+DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en
+partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette
+nouvelle version a été réalisée à partir de photographies d'un exemplaire
+original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal].
+
+[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/)
+[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t)
+[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal)
+
+Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala],
+le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num,
+d'une nouvelle version de l'œuvre sous différents formats.
+
+[^nakala]: [https://nakala.fr/](https://nakala.fr/)
+
+#### Encodage
+
+##### Structure du module *dictionaries*
+
+**Definitions**
+
+By iterating several times the operation of moving on that graph along one edge,
+that is, by considering the transitive closure of the relation "be connected by
+an edge" we define *inclusion paths* which allow us to explore which elements
+may be nested under which other.
+
+The nodes visited along the way represent the intermediate XML elements to
+construct a valid XML tree according to the TEI schema. Given the top-down
+semantics of those trees, we call the length of an inclusion path its *depth*.
+
+The ability for an element to contain itself corresponds directly to loops on
+the graph (that is an edge from a node to itself) as can be illustrated by the
+`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
+another one.
+
+The generalisation of this to inclusion paths of any length greater than one is
+usually called a cycle and we may be tempted in our context to refine this and
+name them *inclusion cycles*. The `<address/>` element provides us with an
+example for this configuration: although an `<address/>` element may not
+directly contain another one, it may contain a `<geogName/>` which, in turn, may
+contain a new `<address/>` element. From a graph theory perspective, we can say
+that it admits an inclusion cycle of length two.
+
+**Applications**
+
+Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
+allows us to explore the shortest inclusion paths that exist between elements.
+Though a particular caution should be applied because there is no guarantee that
+the shortest path is meaningful in general, it at least provides us with an
+efficient way to check whether a given element may or not be nested at all under
+another one and gives a lower bound on the length of the path to expect. Of
+course the accuracy of this heuristic decreases as the length of the elements
+increases in the perfect graph representing the intended, meaningful path
+between two nodes that a human specialist of the TEI framework could build.
+
+This is still very useful when taking into account the fact that TEI modules are
+merely "bags" to group the elements and provide hints to human encoders about
+the tools they might need but have no implication on the inclusion paths between
+elements which cross module boundaries freely. The general graph formalism
+enables us to describe complex filtering patterns and to implement queries to
+look for them among the elements exhaustively by algorithmic means even when the
+shortest-path approach is not enough.
+
+For instance, it lets one find that although `<pos/>` may not be directly
+included within `<entry/>` elements to include information about the
+part-of-speech of the word that an article defines, the correct way to do so is
+through a `<form/>` or a `<gramGrp/>`.
+
+On the other hand, trying to discover the shortest inclusion path to `<pos/>`
+from the `<TEI/>` root of the document yields a `<standOff/>`, an element
+dedicated to store contextual data that accompanies but is not part of the text,
+not unlike an annex, and widely unrelated to the context of encoding an
+encyclopedia.
+
+A last relevant example on the use of these methods can be given by querying the
+shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
+yields an inclusion directly through `<entryFree/>` (with an inclusion path of
+length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
+not what we want depending on the regularity of the articles we are encoding and
+the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
+justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
+length 3 returns as expected the path through `<entry/>`, among others. Overall,
+we get a good general idea: `<pos/>` does not need to be nested very deep, it
+can appear quite near the "surface" of article entries.
+
+##### Limites
+
+###### The `<entry/>` element
+
+The central element of the *dictionaries* module is the `<entry/>` element meant
+to encode one single entry in a dictionary, that is to say a head word
+associated to its definition. It is the natural way in from the `<body/>`
+element to the dictionary module: indeed, although `<body/>` may also contain
+`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
+`<entry/>` while the latter is a device to group several related entries
+together. Both can contain an `<entry/` directly while no obvious inclusion
+exists the other way around: most (> 96.2%) of the inclusion paths of
+"reasonable" depth (which we define as strictly inferior to 5, that is twice the
+average shortest depth between any two nodes) either include `<figure/>` or
+`<castList/>`, two very specific elements which should not need to appear in an
+article in general, showing that the purpose of `<entry/>` is not to contain an
+`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
+documentation but also the structure of the elements graph evidence `<entry/>`
+as the natural top-most element for an article. This somewhat contrived example
+hopes to further demonstrate the application of a graph-centred approach to
+understand the inner workings of the XML-TEI schema.
+
+###### Information about the headword itself
+
+Once a block for an article is created, it may contain elements useful to
+represent various of its features. Its written and spoken forms are usually
+encoded by `<form/>` elements. Grammatical information like the `<case/>`,
+`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`,
+along with information about the categories it belongs to like `<iType/>` for
+its inflection class in languages with a declension system or `<pos/>` for its
+part-of-speech. The `<etym/>` element is made to hold the etymology of an entry.
+In the case when there are alternative spellings in varieties of the language or
+if the spelling has changed over time, `<usg/>` can be used.
+
+All these examples are by no means an exhaustive list; the complete set provides
+the encoder with a toolbox to describe all the information related to the form
+the entry is found at and seems general enough to accomodate the structure of
+any book indexing entries by words.
+
+###### Cross-references
+
+A common feature shared by dictionaries and encyclopedias is the ability to
+connect entries together by using a word or short phrase as the link, referring
+the reader to the related concept. This is known as cross-references and can
+appear either when the definition of a term is adjacent to another one or to
+catch alternative spellings where some readers might expect to find the word and
+redirect them to the form chosen as the reference. In XML-TEI, this is done with
+the `<xr/>` element. It usually contains the whole phrase performing the
+redirection, with an imperative locution like "please see […]".
+
+The "active" part of the cross-reference, that is the very word within the
+`<xr/>` that is considered to be the link or, to make a modern-day HTML
+metaphor, the region that would be clickable, is represented by a `<ref/>`
+element. Though it is not specific to the *dictionaries* module, we include it
+in this description of the toolbox because it is particularly useful in the
+context of dictionaries. This element may have a target attribute which points
+to the other resource to be accessed by the interested reader.
+
+###### Definitions
+
+The remaining part of entries is also usually the largest and represents the
+content associated to the headword by the entry. In a dictionary, that is its
+meaning.
+
+The `<sense/>` element is a valid child for `<entry/>` and groups together a
+definition of the term with `<def/>`, usage examples with `<usg/>` (another use
+of this versatile element) and other high-level information such as translations
+in other languages. Both `<def/>` and `<usg/>` elements may appear directly
+under the `<entry/>`.
+
+###### Structural remarks
+
+Before concluding this description of the *dictionaries* module from the
+perspective of someone trying to concretely encode a particular dictionary or
+encyclopedia, we make use of the graph approach again to evidence some its
+aspects in terms of inclusion structure.
+
+First, it is remarkable that all elements in the *dictionaries* module have a
+cyclic inclusion path, that is to say, there is an inclusion path from each
+element of this module to itself. Although having such a cycle is a widespread
+property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
+of the 557 elements in the other modules), all 33 elements of the *dictionaries*
+module having one is far above this average. In addition, the cycles appear to
+be rather short, with an average length of 2.00 versus 2.50 in the rest of the
+population. This observation is all the more surprising considering the fact
+that the *dictionaries* module contains short "leaf" elements like `<pos/>`
+which should not obviously need to admit cycles since one rather expects them to
+contain only one word, like `<pos>adj</pos>` in the example given in the
+official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
+element made to group quotations with a bibliographic reference to their source
+which should clearly be unnecessary to encode an article in the general case.
+
+Secondly, although we have seen examples of connections from this module to the
+rest of the XML-TEI, especially to the *core* module (see the case of the
+`<ref/>` element above), the *dictionaries* module appears somewhat isolated
+from important structural elements like `<head/>` or `<div/>`. Indeed, computing
+all the paths from either `<entry/>` or `<sense/>` elements to the latter of
+length shorter or equal to 5 by a systematic traversal of the graph yields
+exclusively paths (respectively 9042 and 39093 of them) containing either a
+`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
+suggests, is used to encode text that does not quite fit the regular flow of the
+document, as for example in the context of an embedded narrative. Both examples
+displayed in the online documentation feature a `<body/>` as direct child of
+`<floatingText/>`, neatly separating its content as independent. The purpose of
+the second one, although its name — short for apparatus — is less clear, is to
+wrap together several versions of the same excerpts, for instance when there are
+several possible readings of an unclear group of words in a manuscript, or when
+the encoder is trying to compile a single version of a piece of work from
+several sources which disagree over some passage. In both case, it appears
+obvious that it is not something that is expected to occur naturally in the
+course of an article in general.
+
+Thus, despite a rather dense internal connectivity, the *dictionaries* module
+fails to provide encoders with a device to represent recursively nesting
+structures like `<div/>`.
+
+The situation regarding subject indicators is hardly better outside of the
+module. The `<domain/>` element despite its name belongs exclusively in the
+header of a document and focuses on the social context of the text, not on the
+knowledge area it covers. The `<interp/>` despite its name is not so much about
+labeling something as an interpretation to give to a context (which subject
+indicators could be if you consider that, placed at the beginning, they are used
+to direct the mind frame of the readers towards a particular subject). However,
+the documentation clearly demonstrates it as a tool for annotators of a
+document, which text content is not part of the original document but some
+additional result of an analysis performed in the context of the encoding, used
+only throughout references in XML attributes.
+
+This point, although not the most concerning, still remains the hardest to
+address but all things considered the `<usg/>` element stands out as the most
+relevant.
+
+###### The notion of meaning
+
+Notwithstanding the correct way to represent domains of knowledge, their extent
+itself raises concerns regarding the *dictionaries* module. Indeed, among the
+vast collection of domains covered in encyclopedias in general and in *La Grande
+Encyclopédie* in particular are historical articles and biographies. If the
+notion of meaning can appear at least ill-fitting for a text describing a series
+of historical events, one may still argue that it groups them into a concept and
+associates it to the name of the event. But when it comes to relating the life
+of a person, describing their relation to events and other persons comes out
+even further from the notion of meaning. Entries such as the one about SANJO
+Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
+
+![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29](figure/article/LGE/sanjo_t29.png){#fig:sanjo}
+
+Moreover, encyclopedias, because of all that they have inherited from the
+philosophical Enlightenment, are not only spaces designed to assert, they also
+intrinsically include an interrogative component. Some articles lay down the
+basis required to understand the complexity of an issue and invite the reader to
+consider it without providing a definitive answer, going as far as to explicitly
+use question marks as in the article "Action" displayed in Figure @fig:action.
+
+![Excerpt from article "Action", in La Grande Encyclopédie, tome 1](figure/article/LGE/action_t1.png){#fig:action}
+
+In this extract, the author devises a hypothetical situation to illustrate how
+difficult it is to draw the line between two supposedly mutually exclusive
+subcategories of legal actions. The whole point of the passage is to convey the
+idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
+`<def/>` element would be an utter misnomer.
+
+As a result, the use of `<sense/>` and `<def/>` is not appropriate for
+encyclopedic content in general.
+
+###### Nested structures
+
+The final difficulty can be considered as a partial consequence of the previous
+one on the structure of articles. The difficulty to define complex concepts is
+the very reason why authors approach their subjects from various angles,
+circumnavigating it as a best approximation. This strategy favours long,
+structured developments with sections and subsections covering the multiple
+aspects of the topic: from a historical, political, scientific point of view…
+The longest articles, such as article "Europe" shown in Figure @fig:europe, can
+thus span several dozens of pages. They can contain substructures with titles on
+at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of
+which are in turn generally developed over several paragraphs.
+
+![La Grande Encyclopédie, tome 16, article "Europe", spanning from p.782 to p.846, that is 64 pages, and ending after a bibliography longer than one column of text](figure/article/LGE/europe_t16.png){#fig:europe}
+
+The nested structure that we have just evidenced demands of course a nesting
+structure to accomodate it. More precisely it guides our search of XML elements
+by giving us several constraints: we are looking for a pair of elements, the
+first representing a (sub)section must be able to include both itself and the
+second element, which does not have any special constraint except the one to
+have a semantics compatible with our purpose of using it to represent section
+titles. In addition, the first element must be able to contain several `<p/>`
+elements, `<p/>` being the reference element to encode paragraphs according to
+the XML-TEI documentation.
+
+We have seen that the *dictionaries* module was equiped with a questionable but
+possible element for subject domains. However, it does not include any element
+for section titles. In the rest of the TEI specification, the elements `<head/>`
+and `<title/>` — the latter with the possibility to set its `type` attribute to
+`sub` — stand out as the best candidates for the semantics condition on the
+second element.
+
+##### Choix
+
+###### Candidates in the *dictionaries* module
+
+Filtering the content of the module to keep only the elements which can at the
+same time contain themselves, be included under `<entry/>` and include a `<p/>`
+and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
+It is remarkable that even replacing the `<entry/>` element for the root of each
+article with an `<entryFree/>`, an element supposed to relax some constraint to
+accomodate more unusual structure in dictionaries does not bring any
+improvement.
+
+The lack of results from these simple queries forces us to somewhat release the
+constraints on the encoding we are willing to use. We can for instance make the
+asumption that the occurrence of an intermediate element could be needed between
+the element wrapping the whole article and the recursing one used to encode each
+section. This "section" element could also need a companion element to be able
+to include itself, or, to formalise it in terms of graph theory, we could relax
+the condition that this element admits a loop to consider instead cycles of a
+given (small, this still needs to represent a fairly direct inclusion) length to
+be enough. We simultaneously extend the maximum depth of the inclusion paths we
+are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
+
+By setting this depth to 3, that is, by accepting one intermediate element to
+occur in the middle of each one of the inclusion paths that define the structure
+required to encode encyclopedic discourse, we find 21 elements but none of them
+stand out as an obvious good solution: all paths to include the `<p/>` element
+from any *dictionaries* element either contains a `<figure/>` (which we have
+encountered earlier when we were practising our graph approach to search for
+inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
+general), a `<stage/>` (reserved to stage direction in dramatic works) or a
+`<state/>` (used to describe a temporary quality in a person or place), again
+not even close to what we want. The paths to either `<head/>` or `<title/>` are
+similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns
+the exact same candidates. If that is not a thorough proof that none of these
+elements could fulfill our purpose, it is a fact than no element in this module
+appears as an obvious good solution and a serious hint to keep looking somewhere
+else.
+
+###### Widening the search
+
+We hence widen our search to include elements outside the *dictionaries* module
+which could be used to encode our sections and subsections, under the same
+constraint as before to try and find a composite solution that would remain
+under the `<entry/>` element even if resorting to subcomponents outside of the
+dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>`
+and `<note/>`.
+
+The first one as we have repeatedly underlined is meant for graphic information
+and is not suitable for text content in general.
+
+The purpose of `<metamark/>` is to transcribe the edition marks than may appear
+on a particular primary source in order to alter the normal flow of the text and
+suggest an alternative reading (deletion, insertion, reordering, this is about a
+human editing the text from a given physical copy of it), but it is
+unfortunately of no use to encode a section of an article.
+
+The first element that might at least resemble what we are looking for is the
+last one, `<note/>`. It is meant to contain text, is about explaning something
+and seems general enough (not specific to a given genre, or to the occurrence of
+a particular object on the page). Unfortunately, its semantics still seems a bit
+off compared to our need. The documentation describes it as an "additional
+comment" which appears "out of the main textual stream" whereas the long
+developments in articles are the very matter of the text of encyclopedias, not
+mere remarks in the margins or at the foot of pages.
+
+##### Implémentation
+
+The above remarks explain why the *dictionary* module is unable to represent
+encyclopedias, where the notion of "meaning" is less central that in
+dictionaries and where discourse with nested structures of arbitrary depth can
+occur. Even composite encodings using elements outside of the *dictionaries*
+module under an `<entry/>` element do not meet our requirements. Since the
+*core* module of course accomodates these structures by means of the `<div/>`,
+`<head/>` and `<p/>` elements which have the additional advantage of carrying
+less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
+using them which we recommend using for other projects aiming at representing
+encyclopedias.
+
+To remain consistent with the above remarks we will only concern ourselves with
+what happens at the level of each article, right under the `<body/>` element.
+Everything related to metadata happens as expected in the file's `<teiHeader/>`
+which is well-enough equiped to handle them. In order to present our scheme
+throughout the following section we will be progressively encoding a reference
+article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
+
+![La Grande Encyclopédie, tome 9, article "Cathète"](figure/article/LGE/cathète_t9.png){#fig:cathete-photo}
+
+###### The scheme
+
+Remaining within the *core* module for the structure, almost all useful elements
+are available and our encoding scheme merely quotes the official documentation.
+Each article is represented by a `<div/>`. We suggest setting an `xml:id`
+attribute on it with the head word of the entry — unique in the whole corpus, or
+made so by suffixing a number representing its rank among the various
+occurrences, even when there's only one for the sake of regularity — as its
+value, normalised to lowercase, stripping spaces and replacing all
+non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
+encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
+element on the article "Cathète" previously displayed.
+
+![The container `div` element for article "Cathète"](figure/article/LGE/cathète_0.png){#fig:cathete-xml-0}
+
+Inside this element should be a `<head/>` enclosing the headword of the article.
+The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
+highlighted by any special typographic means such as bold, small capitals, etc.
+The one disappointment of the encoding scheme we are defining in this chapter is
+the lack of support for a proper way to encode subject indicators.
+
+The best candidate we have found so far was `<usg/>` from the *dictionaries*
+module but it is not available directly under a `<head/>` element. All inclusion
+paths from the latter to the former of length less than or equal to 3 contain
+irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
+must be discarded. The next best elements appear to be `<term/>` (not very
+accurate) and `<rs/>` ("referring string", quite a general semantics but a
+possible match — subject indicators refer to a given domain of knowledge —
+although all the examples in the documentation refer to concrete persons,
+places or object, not to the abstract objects that mathematics or poetry are).
+
+For this reason, we do not recommend any special encoding of the subject
+indicator but leave it open to each particular context: they are often
+abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
+are not labeled by a knowledge domain but usually include the first name of the
+person when it is known so in that case an element like `<persName/>` is still
+appropriate. This choice applied to the same article "Cathète" produces Figure
+@fig:cathete-xml-1.
+
+![Encoding the head word of article "Cathète"](figure/article/LGE/cathète_1.png){#fig:cathete-xml-1}
+
+We then propose to wrap each different meaning in a separate `<div/>` with the
+`type` attribute set to `sense` to refer to the `<sense/>` element that would
+have been used within the *core* module. The `<div/>`s should be numbered
+according to the order they appear in with the `n` attribute starting from `0`
+as shown in Figure @fig:cathete-xml-2.
+
+![The empty structure for the only meaning of the word "Cathète"](figure/article/LGE/cathète_2.png){#fig:cathete-xml-2}
+
+In addition, each line within the article must start with a `<lb/>` to mark its
+beginning including before the `<head/>` element as demonstrated by Figure
+@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that
+in the dense layout of encyclopedias, the carriage return separating two
+articles is meaningful. Stating each new line explicitly keeps enough
+information to reconstruct a faithful facsimile but it also has the advantage of
+highlighting the fact than even though the definition is cut from the headword
+by being in a separate XML element, they still occur on the same line, which is
+a typographic choice usually made both in encyclopedias and dictionaries where
+space is at a premium. .
+
+To complete the structure, the various sections and subsections occurring
+within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
+filled with `<p/>` for paragraphs which can each be titled with `<head/>`
+elements local to each `<div/>`.
+
+![A complete encoding of article "Cathète"](figure/article/LGE/cathète_3.png){#fig:cathete-xml-3}
+
+Some articles such as "Boumerang" have figures with captions, as illustrated by
+Figure @fig:boumerang-photo, which should be encoded the standard way by
+`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml.
+
+![La Grande Encyclopédie, tome 7, article "Boumerang"](figure/article/LGE/boumerang_t7.png){height=300px #fig:boumerang-photo}
+
+![Encoding the figure in article "Boumerang" and its captions](figure/article/LGE/boumerang.png){#fig:boumerang-xml}
+
+Another issue arising from giving up on `<entry/>` is the unavailability of the
+`<xr/>` element, not allowed under any of the *core* elements we use but which
+is useful to represent cross-references occurring in encyclopedias as well as in
+dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
+We prefer to use the `<ref/>` element instead which is available in the context
+of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
+article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
+Another solution would have been to introduce a `<dictScrap/>` element for the
+sole purpose of placing an `<xr/>` but we advocate against it on account of the
+verbosity it would add to the encoding and the fact that it implicitly suggests
+that the previous context was not the one of a dictionary.
+
+![La Grande Encyclopédie, tome 18, article "Gelocus"](figure/article/LGE/gelocus_t18.png){#fig:gelocus-photo}
+
+![Encoding the cross-references in article "Gelocus"](figure/article/LGE/gelocus.png){#fig:gelocus-xml}
+
+A typical page of an encyclopedia also features peritext elements, giving
+information to the reader about the current page number along with the headwords
+of the first and last articles appearing on the page. Those can be encoded by
+`<fw/>` elements ("forme work") which `place` and `type` attributes should be
+set to position them on the page and identify their function if it has been
+recognised (those short elements on the border of pages are the ones typically
+prone to suffer damages or be misread by the OCR).
+
+Finally there are other TEI elements useful to represent "events" in the flow of
+the text, like the beginning of a new column of text or of a new page. Figure
+@fig:alcala-photo shows the top left of the last page of the first tome of *La
+Grande Encyclopédie* which features peritext elements while marking the
+beginning of a new page. The usual appropriate elements (`<pb/>` for page
+beginning, `<cb/>` for column beginning) may and should be used with our
+encoding scheme as demonstrated by Figure @fig:alcala-xml.
+
+![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](figure/article/LGE/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
+
+![Encoding the beginning of a page in article "Alcala-de-Hénarès"](figure/article/LGE/alcala.png){#fig:alcala-xml}
+
+###### Currently implemented
+
+The reference implementation for this encoding scheme is the program
+soprano[^soprano] developed within the scope of project DISCO-LGE to
+automatically identify individual articles in the flow of raw text from the
+columns and to encode them into XML-TEI files. Though this software has already
+been used to produce the first TEI version of *La Grande Encyclopédie*, it does
+not yet follow the above specification perfectly. Figure
+@fig:cathete-xml-current shows the encoded version of article "Cathète" it
+currently produces:
+
+[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
+
+![The current encoding of article "Cathète" produced by `soprano`](figure/article/LGE/cathète_current.png){#fig:cathete-xml-current}
+
+The headword detection system is not able to capture the subject indicators yet
+so it appears outside of the `<head/>` element. No work is performed either to
+expand abbreviations and encode them as such, or to distinguish between domain
+and people names.
+
+Likewise, since the detection of titles at the beginning of each section is not
+complete, no structure analysis can be performed at the moment on the textual
+development inside the article and it is left unstructured, directly under the
+entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
+paragraphs are not yet identified and for this reason not encoded.
+
+However, the figures and their captions are already handled correctly when they
+occur. The encoder also keeps track of the current lines, pages, and columns and
+inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
+numbers pages so that the numbering corresponding to the physical pages are
+available, as compared to the "high-level" pages numbers inserted by the
+editors, which start with an offset because the first, blank or almost empty
+pages at the beginning of each book do not have a number and which sometimes have
+gaps when a full-page geographical map is inserted since those are printed
+separately on a different folio which remains outside of the textual numbering
+system. The place at which these layout-related elements occur is determined by
+the place where the OCR software detected them and by the reordering performed
+by `soprano` when inferring the reading order before segmenting the articles.
+
+###### The constraints of automated processing
+
+Encyclopedias are particularly long books, spanning numerous tomes and
+containing several tenths of thousands of articles. The *Encyclopédie* comprises
+over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
+version produced by `soprano` created 160k articles, but their segmentation is
+still not perfect and if some article beginning remain undetected, all the very
+long and deeply-structured articles are unduly split into many parts, resulting
+globally in an overestimation of the total number).
+
+XML-TEI is a very broad tool useful for very different applications. Some
+elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
+information (for the second one, adjacent to a notion as elusive as truth)
+which requires a very deep understanding of a text in its entirety and about
+which even some human experts may disagree.
+
+For these reasons, a central concern in the design of our encoding scheme was to
+remain within the boundaries of information that can be described objectively
+and extracted automatically by an algorithm. Most of the tags presented above
+contain information about the positions of the elements or their relation to one
+another. Those with an additional semantics implication like `<head/>` can be
+inferred simply from their position and the frequent use of a special typography
+like bold or upper-case characters.
+
+The case of cross-references is particular and may appear as a counter-example
+to the main principle on which our scheme is based. Actually, the process of
+linking from an article to another one is so frequent (in dictionaries as well
+as in encyclopedias) that it generally escapes the scope of regular discourse to
+take a special and often fixed form, inside parenthesis and after a special
+token which invites the reader to perform the redirection. In *La Grande
+Encyclopédie*, virtually all the redirections (that is, to the extent of our
+knowledge, absolutely all of them though of course some special case may exist,
+but they are statistically rare enough that we have not found any yet) appear
+within parenthesis, and start with the verb "voir" abbreviated as a single,
+capital "V." as illustrated above in the article "Gelocus" (see again Figure
+@fig:gelocus-photo).
+
+Although this has not been implemented yet either, we hope to be able to detect
+and exploit those patterns to correctly encode cross-references. Getting the
+`target` attributes right is certainly more difficult to achieve and may require
+processing the articles in several steps, to first discover all the existing
+headwords — and hence article IDs — before trying to match the words following
+"V." with them. Since our automated encoder handles tomes separately and since
+references may cross the boundaries of tomes, it cannot wait for the target of a
+cross-reference to be discovered by keeping the articles in memory before
+outputting them.
+
+This is in line with the last important aspect of our encoder. If many
+lexicographers may deem our encoding too shallow, it has the advantage of not
+requiring to keep too complex datastructures in memory for a long time. The
+algorithm implementing it in `soprano` outputs elements as soon as it can, for
+instance the empty elements already discussed above. For articles, it pushes
+lines onto a stack and flushes it each time it encounters the beginning of the
+following article. This allows the amount of memory required to remain
+reasonable and even lets them be parallelised on most modern machines. Thus,
+even taking over three minutes per tome, the total processing time can be
+lowered to around forty minutes on a machine with 16Go of RAM for the whole of
+*La Grande Encyclopédie* instead of over one hour and a half.
+
+
diff --git a/Corpus/text.sh b/Corpus/text.sh
new file mode 100755
index 0000000000000000000000000000000000000000..1206b755369d803cfa908d868493c6f1eb8c883b
--- /dev/null
+++ b/Corpus/text.sh
@@ -0,0 +1,7 @@
+#!/bin/sh
+
+source ./chapter.sh 'Préparation et enrichissement du corpus'
+
+cat Corpus/Formats_et_états.md
+cat Corpus/Domaines.md
+cat Corpus/Annotation.md
diff --git a/Glossaire/OCR.md b/Glossaire/OCR.md
new file mode 100644
index 0000000000000000000000000000000000000000..9c0ac9758384ff41716ef93d5587f47d091d0092
--- /dev/null
+++ b/Glossaire/OCR.md
@@ -0,0 +1,7 @@
+OCR
+
+:	*Optical Character Recognition*, reconnaissance optique de caractères, est
+le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de
+caractères compréhensibles par la machine et traitables ensuite par des moyens
+automatiques, à partir d'une image.
+
diff --git a/Glossaire.md b/Glossaire/OLR.md
similarity index 73%
rename from Glossaire.md
rename to Glossaire/OLR.md
index 2798cb54bba2e3bd5cd32e139099006456be56e6..9e4665549db43d1f312186708a0c2822a4cfa5e3 100644
--- a/Glossaire.md
+++ b/Glossaire/OLR.md
@@ -1,12 +1,3 @@
-# Glossaire {-}
-
-OCR
-
-:	*Optical Character Recognition*, reconnaissance optique de caractères, est
-le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de
-caractères compréhensibles par la machine et traitables ensuite par des moyens
-automatiques, à partir d'une image.
-
 OLR
 
 :	*Optical Layout Recognition*, reconnaissance optique de la disposition de la
diff --git a/Glossaire/text.sh b/Glossaire/text.sh
new file mode 100755
index 0000000000000000000000000000000000000000..fd4527981dcd1dfa0268ad07b5e13f66e1d25db4
--- /dev/null
+++ b/Glossaire/text.sh
@@ -0,0 +1,8 @@
+#!/bin/sh
+
+[ -n "${HEADER_INCLUDED}" ] || source ./header.sh 2
+
+echo '# Glossaire {-}'
+
+cat Glossaire/OCR.md
+cat Glossaire/OLR.md
diff --git "a/G\303\251ographie/Contours.md" "b/G\303\251ographie/Contours.md"
new file mode 100644
index 0000000000000000000000000000000000000000..878c6f5c86648d16ca26b0404042485803cbe19b
--- /dev/null
+++ "b/G\303\251ographie/Contours.md"
@@ -0,0 +1,11 @@
+## Tracer le contours de la géographie
+
+### Établir une correspondance
+
+Empiriquement:
+    + avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés
+    + machine learning
+
+### La biographie cachée
+
+
diff --git "a/G\303\251ographie/ENE.md" "b/G\303\251ographie/ENE.md"
new file mode 100644
index 0000000000000000000000000000000000000000..2a76f9049f0fa35d395ab11e3ece2b93f4f7fa4a
--- /dev/null
+++ "b/G\303\251ographie/ENE.md"
@@ -0,0 +1,9 @@
+## Entités Nommées Étendues
+
+Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022
+
+### Travaux sur les GNNs
+
+Qu'est-ce qu'on en a retiré ?
+
+
diff --git "a/G\303\251ographie.md" "b/G\303\251ographie/Relations_entre_domaines.md"
similarity index 98%
rename from "G\303\251ographie.md"
rename to "G\303\251ographie/Relations_entre_domaines.md"
index c35139f1185a6d61a2b9af53663bf6b5f1fe7516..992c20b94a278ae6f650578767b4532af6b361ae 100644
--- "a/G\303\251ographie.md"
+++ "b/G\303\251ographie/Relations_entre_domaines.md"
@@ -1,24 +1,3 @@
-# Identifier et problématiser la géographie
-
-## Relation entre spatial et géographique
-
--> questionnement d'une frontière même
-
-(structuration de la géographie)
-
-## Tracer le contours de la géographie
-
-### Établir une correspondance
-
-Empiriquement:
-    + avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés
-    + machine learning
-
-### La biographie cachée
-
-## Variété des genres discursifs au sein des articles
-
-
 ## Relations entre les domaines de connaissances
 
 ### Erreurs de classification
@@ -735,11 +714,4 @@ differences we have underlined show that size alone cannot explain their
 distribution in detail. The model does seem to identify some classes
 more easily because of distinctive lexical patterns.
 
-## Entités Nommées Étendues
-
-Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022
-
-### Travaux sur les GNNs
-
-Qu'est-ce qu'on en a retiré ?
 
diff --git "a/G\303\251ographie/Spatial_et_g\303\251ographie.md" "b/G\303\251ographie/Spatial_et_g\303\251ographie.md"
new file mode 100644
index 0000000000000000000000000000000000000000..c8b326668fd2c86828d5818d29467bf609012aae
--- /dev/null
+++ "b/G\303\251ographie/Spatial_et_g\303\251ographie.md"
@@ -0,0 +1,7 @@
+## Relation entre spatial et géographique
+
+-> questionnement d'une frontière même
+
+(structuration de la géographie)
+
+
diff --git "a/G\303\251ographie/Vari\303\251t\303\251_des_genres_discursifs.md" "b/G\303\251ographie/Vari\303\251t\303\251_des_genres_discursifs.md"
new file mode 100644
index 0000000000000000000000000000000000000000..2fc3e16ad990651255309895ba0c7641ac3db91a
--- /dev/null
+++ "b/G\303\251ographie/Vari\303\251t\303\251_des_genres_discursifs.md"
@@ -0,0 +1,4 @@
+## Variété des genres discursifs au sein des articles
+
+
+
diff --git "a/G\303\251ographie/text.sh" "b/G\303\251ographie/text.sh"
new file mode 100755
index 0000000000000000000000000000000000000000..f6d893f45ccc8699725108482822a2940fbfe627
--- /dev/null
+++ "b/G\303\251ographie/text.sh"
@@ -0,0 +1,9 @@
+#!/bin/sh
+
+source ./chapter.sh 'Identifier et problématiser la géographie'
+
+cat Géographie/Spatial_et_géographie.md
+cat Géographie/Contours.md
+cat Géographie/Variété_des_genres_discursifs.md
+cat Géographie/Relations_entre_domaines.md
+cat Géographie/ENE.md
diff --git a/Introduction.md b/Introduction.md
deleted file mode 100644
index 136c48f8eb3dafedbd610b16dae7979fe465fb58..0000000000000000000000000000000000000000
--- a/Introduction.md
+++ /dev/null
@@ -1,57 +0,0 @@
----
-title: Méthodes et outils pour l'étude diachronique des discours géographiques dans les encyclopédies françaises
-author: Alice \textsc{Brenon}
-documentclass: report
-classoptions:
-    - french
-    - a4paper
-    - 11pt
-numbersections: true
-header-includes:
-    - \setcounter{tocdepth}{2}
-    - \setcounter{secnumdepth}{2}
-    - \usepackage{textalpha}
-    - \usepackage{geometry}
-    - \usepackage{caption}
-    - \usepackage{subcaption}
----
-
-\tableofcontents
-
-\newpage
-
-# Introduction {-}
-
-## Cadre de cette thèse
-
-### Le genre encyclopédique
-
-L'«esprit encyclopédique» [@Macary1973_MACLDU]
-
-Les précurseurs de l'EDdA : Basnage [@galleron_tenir_2022] dont est issu le
-Trevoux [@le_guern_caief_0571_5865_1983_num_35_1_2402], qui se posera comme un
-farouche opposant de l'EDdA [@morin_rde_0769_0886_1989_num_7_1_1034]. L'EDdA ne
-devait initialement être qu'une traduction de Chambers
-[@kafker_andre_francois_2016].
-
-### La géographie, une science en recomposition
-
-Période intermédiaire marquée par une professionnalisation
-[@rey_professionnalisation_2022] de l'encyclopédisme
-
-### Le projet GÉODE
-
-Notre corpus de 4 encyclopédies que nous avons choisies -> celles que j'ai pu
-regarder et pourquoi
-
-## Contributions
-
-### Version numérique structurée de LGE
-
-Segmentation (premier résultat par rapport à la version de base — "baseline" —
-de fin de Collex-Persée, pour tâche de segmentation). Visée patrimoniale, outil
-pour les chercheur·ses en SHS (recherche par vedette).
-
-### La biographie dans l'EDdA
-
-### Motifs discursifs géographiques
diff --git a/Introduction/Cadre.md b/Introduction/Cadre.md
new file mode 100644
index 0000000000000000000000000000000000000000..8aa8d47d80a4c4e52d1bfbabed9849a3663e011c
--- /dev/null
+++ b/Introduction/Cadre.md
@@ -0,0 +1,22 @@
+## Cadre de cette thèse
+
+### Le genre encyclopédique
+
+L'«esprit encyclopédique» [@Macary1973_MACLDU]
+
+Les précurseurs de l'EDdA : Basnage [@galleron_tenir_2022] dont est issu le
+Trevoux [@le_guern_caief_0571_5865_1983_num_35_1_2402], qui se posera comme un
+farouche opposant de l'EDdA [@morin_rde_0769_0886_1989_num_7_1_1034]. L'EDdA ne
+devait initialement être qu'une traduction de Chambers
+[@kafker_andre_francois_2016].
+
+### La géographie, une science en recomposition
+
+Période intermédiaire marquée par une professionnalisation
+[@rey_professionnalisation_2022] de l'encyclopédisme
+
+### Le projet GÉODE
+
+Notre corpus de 4 encyclopédies que nous avons choisies -> celles que j'ai pu
+regarder et pourquoi
+
diff --git a/Introduction/Contributions.md b/Introduction/Contributions.md
new file mode 100644
index 0000000000000000000000000000000000000000..5e3338a04ef595b987f2705995613769014bcaf9
--- /dev/null
+++ b/Introduction/Contributions.md
@@ -0,0 +1,11 @@
+## Contributions
+
+### Version numérique structurée de LGE
+
+Segmentation (premier résultat par rapport à la version de base — "baseline" —
+de fin de Collex-Persée, pour tâche de segmentation). Visée patrimoniale, outil
+pour les chercheur·ses en SHS (recherche par vedette).
+
+### La biographie dans l'EDdA
+
+### Motifs discursifs géographiques
diff --git a/Introduction/text.sh b/Introduction/text.sh
new file mode 100755
index 0000000000000000000000000000000000000000..dac89bb3c2a8712ff2156c07b521fd8c4633abab
--- /dev/null
+++ b/Introduction/text.sh
@@ -0,0 +1,7 @@
+#!/bin/sh
+
+source ./chapter.sh "Introduction {-}"
+
+cat Introduction/Cadre.md
+cat Introduction/Contributions.md
+
diff --git a/Makefile b/Makefile
index f9d650d9eb386bc6ca09dd87dae25c1104dd46bf..7044dbcabf2340d6ad620d9c749fca7f52d9243c 100644
--- a/Makefile
+++ b/Makefile
@@ -1,21 +1,33 @@
 DOCUMENT = Manuscrit
-CHAPTERS = Introduction ÉdlA Corpus Géographie Contrastes Conclusion Glossaire Bibliographie
-SOURCES = $(CHAPTERS:%=%.md)
+CHAPTERS = Introduction ÉdlA Corpus Géographie Contrastes Conclusion Glossaire
+SOURCES = $(CHAPTERS:%=%/text.sh)
 BIBLIOGRAPHY = biblio.bib
 SNIPPETS = $(wildcard src/*.md)
 GRAPHS = $(wildcard src/*.gv)
-#PICTURES = $(action_t1 arbre boumerang_t7 cathète_t9 europe_t16 gelocus_t18 last_page_top_left_t1 sanjo_t29:%=article/%)
-#FIGURES = $(PICTURES:%=figure/%.png) $(GRAPHS:src/%.gv=figure/%.png) $(SNIPPETS:src/%.md=figure/%.png)
-FIGURES = $(shell sed -n 's@.*(\(figure/.*.\(png\|jpe?g\)\)).*@\1@p' $(SOURCES))
+FIGURES = $(shell find $(CHAPTERS) -type f -name '*.md' -exec cat '{}' \; | sed -n 's@.*(\(figure/.*.\(png\|jpe?g\)\)).*@\1@p')
 CSL = apa.csl
 FILTERS = pandoc-fignos
+LUA_FILTERS = ./filters/with-bibliography.lua
+WITH_FILTERS = $(FILTERS:%=--filter %) $(LUA_FILTERS:%=--lua-filter %)
 FILTER_SCRIPTS = glossary
 SCRIPTS = $(FILTER_SCRIPTS:%=scripts/%)
 
+DEPENDENCIES=$(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY)
+
+.SECONDEXPANSION:
+
+sources = $(shell find $(1) -type f -name '*.md')
+chapter-sources = $(call sources,$*)
+
 all: $(DOCUMENT).pdf
 
-$(DOCUMENT).pdf: $(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY) $(SOURCES)
-	cat $(SOURCES) $(SCRIPTS:%=| %) | pandoc $(FILTERS:%=--filter %) --citeproc --bibliography=$(BIBLIOGRAPHY) --csl=$(CSL) -o $@
+$(CHAPTERS:%=%.pdf):
+
+$(DOCUMENT).pdf: $(DOCUMENT).sh $(foreach chapter,$(CHAPTERS),$(call sources,$(chapter))) $(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY)
+	./$(DOCUMENT).sh $(SCRIPTS:%=| %) | pandoc $(WITH_FILTERS) -o $@
+
+%.pdf: %/text.sh $${chapter-sources} $(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY)
+	$< $(SCRIPTS:%=| %) | pandoc $(WITH_FILTERS) -o $@
 
 figure/%.png: src/%.gv
 	dot -Tpng $< -o $@
diff --git a/Manuscrit.sh b/Manuscrit.sh
new file mode 100755
index 0000000000000000000000000000000000000000..01c7075fc61ceed42ce33e777fbd1a329667a5ce
--- /dev/null
+++ b/Manuscrit.sh
@@ -0,0 +1,18 @@
+#!/bin/sh
+
+. ./header.sh
+
+cat <<EOF
+\\tableofcontents
+\\newpage
+EOF
+
+
+Introduction/text.sh
+ÉdlA/text.sh
+Corpus/text.sh
+Géographie/text.sh
+Contrastes/text.sh
+Conclusion/text.sh
+Glossaire/text.sh
+
diff --git a/chapter.sh b/chapter.sh
new file mode 100644
index 0000000000000000000000000000000000000000..b5159b91c78d20290be9fd3d9fc50203a4650b68
--- /dev/null
+++ b/chapter.sh
@@ -0,0 +1,5 @@
+[ -n "${HEADER_INCLUDED}" ] || source ./header.sh 2
+
+echo "# ${1}"
+echo '\etocsettocstyle{\rule{\linewidth}{\tocrulewidth}\vskip0.5\baselineskip}{\rule{\linewidth}{\tocrulewidth}}'
+echo '\localtableofcontents'
diff --git a/filters/with-bibliography.lua b/filters/with-bibliography.lua
new file mode 100644
index 0000000000000000000000000000000000000000..e242a9d7527cdfea4fe4e2a5d3911fc5196915a8
--- /dev/null
+++ b/filters/with-bibliography.lua
@@ -0,0 +1,20 @@
+function Pandoc(doc)
+	level = doc.meta['bibliography-level']
+	if level == nil then
+		level = 1
+	else
+		level = level[1].text
+	end
+
+	title = doc.meta['bibliography-title']
+	if title == nil or title == '' then
+		error("The bibliography-title metadata parameter hasn't been defined")
+	end
+
+	doc.blocks:extend({pandoc.Header(
+		level,
+		title,
+		{id = 'bibliography', class = 'unnumbered'}
+	)})
+	return pandoc.utils.citeproc(doc)
+end
diff --git a/header.sh b/header.sh
new file mode 100644
index 0000000000000000000000000000000000000000..ea9e184fb69a993f75fc310cef7b5b2db485155e
--- /dev/null
+++ b/header.sh
@@ -0,0 +1,9 @@
+echo '---'
+if [ -z "${1}" ]
+then
+	cat manuscrit.yml
+else
+	sed 1,${1}d manuscrit.yml
+fi
+echo '---'
+export HEADER_INCLUDED=Y
diff --git a/manuscrit.yml b/manuscrit.yml
new file mode 100644
index 0000000000000000000000000000000000000000..8a4864de98061ac06265a5edc3a52817b343bcb8
--- /dev/null
+++ b/manuscrit.yml
@@ -0,0 +1,24 @@
+title: Méthodes et outils pour l'étude diachronique des discours géographiques dans les encyclopédies françaises
+author: Alice \textsc{Brenon}
+documentclass: report
+classoptions:
+    - a4paper
+    - 11pt
+numbersections: true
+bibliography: biblio.bib
+csl: apa.csl
+link-citations: true
+bibliography-level: 1
+bibliography-title: Bibliographie
+header-includes:
+    - \usepackage[french]{babel}
+    - \setcounter{tocdepth}{2}
+    - \setcounter{secnumdepth}{2}
+    - \usepackage{textalpha}
+    - \usepackage{geometry}
+    - \usepackage{caption}
+    - \usepackage{subcaption}
+    - \usepackage{etoc}
+    - \newlength\tocrulewidth
+    - \setlength{\tocrulewidth}{1.5pt}
+
diff --git "a/\303\211dlA.md" "b/\303\211dlA.md"
deleted file mode 100644
index 9948b96da2ed2f4f51574b30e38427a342a9d5a2..0000000000000000000000000000000000000000
--- "a/\303\211dlA.md"
+++ /dev/null
@@ -1,392 +0,0 @@
-# État de l'art
-
-## Textométrie
-
-### Cadre
-
-Origine via l'«École Française» de Benzécri [@benzecri__analyse_1973] tout à
-fait du côté mathématique / statistiques. Initialement, ça ne concerne que les
-mots bruts (les formes), puis la technologie permet de traiter du texte annoté
-(morpho-syntaxe puis syntaxe), faisant émerger la linguistique de corpus
-[@nazarenko_hal_00619268].
-
-Différentes modèles de distribution statistique des mots sont employées: khi2,
-loi de Poisson. @lafon_sur_1980 propose l'emploi d'une loi hypergéométrique
-(choix qui restera dans la conception de TXM [@heiden2010]).
-
-L'ouvrage fondateur traite de l'utilisation des corpus annotés en commentant une
-étude de discours de Mitterrand [@Labb1983FranoisM] \(un précurseur du corpus
-des Vœux de TXM [@heiden2010] ?), puis des dimensions transversales et de l'usage
-contrastif dans le cadre d'études diachroniques et enfin traite de la
-constitution des corpus eux-même. L'horizon est à l'époque le million de mots
-(notre corpus parallèle, 8 millions de tokens).
-
-### Contrastes
-
-Sur la constitution des corpus @pincemin_heterogeneite_2012 avertit qu'il est
-plus qu'un agglomérat de textes, tout en mentionnant une approche *WAC*
-privilégiant les volumes sur une construction délibérée. Notre étude se situe un
-peu entre les deux j'imagine ? Pas de place pour des textes non-encyclopédiques
-pour contraster, et un peu les articles qu'on peut récupérer dans l'état dans
-lequel on peut les récupérer.
-
-@laramee_production_2017 emploie une démarche contrastive pour faire opposer les
-tomes de l'EDdA et mettre en évidence le rôle des différents auteurs.
-
-### Arbre lexico-syntaxiques récurrents
-
-On commence à mentionner dans @nazarenko_hal_00619268 des «stéréotypes»
-
-Ils sont basés sur les notions de collocations @fellbaum_idioms_2007 puis de motif
-@longree_les_2008
-
-sont un processus récursif et permettent de s'abstraire des réalisation de
-surface contigentes à une langue @tutin_routines_2016
-
-### Possibilités
-
-Des tournures de phrases peuvent être liées à des genres, ce qui peut être
-révélé par une étude contrastive @kraif_constructions_2016,
-@gonon_phraseologismes_2020 similaire à notre objectif.
-
-## La place de la géographie
-
-
-## Genre textuel
-
-### Saisir la notion de genre
-
-@beauvisage_2001 explore le genre policier ($\rightarrow$ à lire pour voir s'il
-y a une caractérisation intéressante de la notion de «genre»)
-
-### Le cas de la lexicographie
-
-Les dictionnaires entretiennent une relation étroite avec la notion de
-collocations et de phraséologismes: les entrées sont d'autant plus utiles qu'elles tiennent compte des
-phraséologismes existant dans la langue, des modèles de langue
-
-@zhu_discours_2022 s'intéresse à la structure propre aux dictionnaires qui met
-en relation un terme et une définition. @loiseau_dictionnaires_2011
-
-If encyclopedias are thus historically more recent than dictionaries they also
-depart from the latter on their approach. The purpose of dictionaries from their
-origin is to collect words, to make an exhaustive inventory of the terms used in
-a domain or in a language in order to associate a *definition* to them, be it a
-translation in another language for a foreign language dictionary or a phrase
-explaining it for other dictionaries. As such, they are collections of *signs*
-and remain within the linguistic level of things. Entries in a dictionary often
-feature information such as the part of speech, the pronunciation or the
-etymology of the word they define.
-
-The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
-types of dictionaries: one to define *words*, the second to define *facts* and
-the last one to define *things*, corresponding to the distinction between
-language, history, and science and arts dictionaries although according to its
-author, d'Alembert, each has to be of more than just one kind to be really good.
-In the full title of the *Encyclopédie*, the concept is more or less equated by
-means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*,
-"reasoned dictionary", introducing the idea of encyclopedias as dictionaries
-with additional structure and a philosophical dimension.
-
-Back to the "Encyclopédie" article we read that a dictionary remaining strictly
-at the language level, a vocabulary, can be seen as the empty frame required for
-an encyclopedic dictionary that will fill it with additional depth. Given how
-d'Alembert insists on the importance of brevity for a clear definition in the
-"Dictionnaire de Langues" entry, it is clear that the *encyclopédistes* did not
-consider encyclopedias superior to dictionaries but really as a new subgenre
-departing from them in terms of purpose.
-
-The first immediately visible feature that sets encyclopedias apart from
-dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
-Encyclopédie* is the presence of subject indicators at the beginning of articles
-right after the headword which organise them into a domain classification
-system. Those generally cover a broad range of subjects from scientific
-disciplines to litterature, and extending to political subjects and law.
-
-No element in the *dictionaries* module is explicitely designed for the purpose
-of encoding these indicators. As we have seen above, the elements set is geared
-towards the words themselves instead of the concept they represent. The closest
-tool for what we need is found in the `<usg/>` element used with a specific
-`type` attribute set to `dom` for "domain". Indeed several examples from the
-documentation encode subject indicators very similar to the ones found in
-encyclopedias within this element, but the match is not perfect either: all
-appear within one of multiple senses, as if to clarify each context in which the
-word can be used, as expected from the element's name, "usage". In
-encyclopedias, if the domain indicator does in certain cases help to distinguish
-between several entries sharing the same headword, the concept itself has
-evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the
-adjective *raisonné* in the rest of the title directly introduces a notion of
-structure that links back to the "Systême figuré des connoissances humaines"
-[@blanchard2002] which schematic structure is shown in Figure
-@fig:systeme_figure. The authors have devised a branching system to classify all
-knowledge, and the occurrence at the beginning of articles, more than a tool to
-clear up possible ambiguities also points the reader to the correct place in
-this mind map.
-
-!["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie](figure/systême_figuré.png){#fig:systeme_figure}
-
-The situation regarding subject indicators is hardly better outside of the
-module. The `<domain/>` element despite its name belongs exclusively in the
-header of a document and focuses on the social context of the text, not on the
-knowledge area it covers. The `<interp/>` despite its name is not so much about
-labeling something as an interpretation to give to a context (which subject
-indicators could be if you consider that, placed at the beginning, they are used
-to direct the mind frame of the readers towards a particular subject). However,
-the documentation clearly demonstrates it as a tool for annotators of a
-document, which text content is not part of the original document but some
-additional result of an analysis performed in the context of the encoding, used
-only throughout references in XML attributes.
-
-This point, although not the most concerning, still remains the hardest to
-address but all things considered the `<usg/>` element stands out as the most
-relevant.
-
-### Discours scientifique
-
-Étudié sous l'angle des ALR par @ji_hal_01956323
-
-## Diachronie
-
-### Diachronie
-
-@diwersy_ressources_2017 s'attache à montrer les difficultés rencontrées en
-français sur la période XVIème -> XVIIème (graphie, ordre des mots, tokenization).
-
-@mayaffre_explorer_2019 montre un usage possible 
-
-@mouhouche_etude_2014 application à la didactique et épistémologie. Étude de la
-terminologie en physique, verbe résonner, de l'origine accoustique jusqu'à
-l'application aux planètes pour véhiculer les notions d'accord et de transfert
-d'énergie. Pas de textométrie mais une analyse qualitative d'occurrences. Voire
-peut-être une référence à Gaston Bachelard sur la notion d'*obstacle verbal*
-(Bachelard 1928).
-
-## Encodage XML-TEI
-
-### Module *dictionaries*
-
-The XML-TEI standard has a modular structure consisting of optional parts each
-covering specific needs such as the physical features of a source document, the
-transcription of oral corpora or particular requirements for textual domains
-like poetry, or, in our case, dictionaries. After describing why the dedicated
-module was a natural candidate to meet our needs, we formalise tools from
-graph theory to browse the specifications of this standard in a rational way and
-explore this module in detail.
-
-### A good starting point
-
-Data produced in the context of a project such as DISCO-LGE cannot be useful to
-future scientific projects unless it is *interoperable* and *reusable*. These
-are the two last key aspects of the FAIR[^FAIR] principles (*findability*,
-*accessibility*, *interoperability* and *reusability*) which we strive to follow
-as a guideline for efficient and quality research. It entails using standard
-formats and a standard for encoding historical texts in the context of digital
-humanities is XML-TEI, collectively developped by the *Text Encoding Initiative*
-consortium which publishes a set of technical specifications under the form of
-XML schemas, along with a range of tools to handle them and training resources.
-
-[^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)
-
-The *dictionaries* module has been leveraged to encode dictionaries in projects
-NENUFAR[^NENUFAR] and BASNUM[^BASNUM] to encode respectively the *Petit Larousse
-Illustré* published by Pierre Larousse in 1905 [@bohbot2018], roughly
-contemporary to our target encyclopedia and the *Dictionnaire Universel* by
-Furetière, or rather its second edition edited by Henri Basnage de Beauval, an
-encyclopedic dictionary from the very early 18^th^ century [@williams2017].
-These successes made it a good starting point for our own encoding but the
-former does not have the encyclopedic dimension our corpus has and the latter is
-a much older text which had a tremendous influence on the european encyclopedic
-effort of the 18^th^ century but is not as clearly separated from the
-dictionaric stem as *La Grande Encyclopédie* is. For these reasons, we could not
-directly reuse the encoding schemes used in these projects and had to explore
-the XML-TEI schema systematically to devise our own.
-
-[^NENUFAR]: [https://cahier.hypotheses.org/nenufar](https://cahier.hypotheses.org/nenufar)
-[^BASNUM]: [https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003)
-
-The XML-TEI specification contains 590 elements, which are each documented on
-the consortium's website in the online reference pages. With an average of
-almost 80 possible child elements (79.91) within any given element, manually
-browsing such an massive network can prove quite difficult as the number of
-combinations sharply increases with each step.
-
-We transform the problem by representing this network as a directed graph, using
-elements of XML-TEI as nodes and placing edges if the destination node may be
-contained within the source node according to the schema. Please note that the
-word "element" is here used with the same meaning as in the TEI documentation to
-refer to the conceptual device characterised by a given tag name such as `p` or
-`div` and not to a particular instance of them that may occur in a given
-document. Figure @fig:dictionaries-subgraph, by using this transformation to
-display the *dictionaries* module, hints at the overall complexity of the whole
-specification.
-
-![The subgraph of the *dictionaries* module](figure/dictionaries.png){#fig:dictionaries-subgraph}
-
-### Application à la lexicographie
-
-The previous section about the structure of the *dictionaries* module and the
-features found in encyclopedias follows quite closely our own journey trying to
-encode first manually then by automatic means the articles of our corpus. This
-back and forth between trying to find patterns in the graph which reflects the patterns
-found in the text and questioning the relevance of the results explains the
-choice we ended up making but also the alternatives we have considered.
-
-#### Bend the semantics
-
-Several times, the issue of the semantics of some elements which posess the
-properties we need came up. This is the case for instance of the `<sense/>` and
-`<node/>` elements. It is very tempting to bend their documented semantics or to
-consider that their inclusion properties is part of what defines them, and hence
-justifies their ways in creative ways not directly recommended by the TEI
-specifications.
-
-This is the approach followed by project BASNUM[^BASNUM]. In the articles
-encoded for this project, `<note/>` elements are nested and used to structure
-the encyclopedic developments that occur in the articles.
-
-We have chosen not to follow the same path in the name of the FAIR principles to
-avoid the emergence of a custom usage differing from the documented one.
-
-#### Custom schema
-
-The other major reason behind our choice was the inclusion rules which exist
-between TEI elements and pushed us to look for different combinations. Another
-valid approach would have consisted in changing the structure of the inclusion
-graph itself, that is to say modify the rules. If `<entry/>` is the perfect
-element to encode article themselves, all that is really missing is the ability
-to accomodate nested structures with the `<div/>` element. This would also have
-the advantage of recovering the `<usg/>` and `<xr/>` elements which we have
-recognised as useful and which we lose as part of the tradeoff to get nested
-sections. Generating customised TEI schemas is made really easy with tools like
-ROMA[^ROMA], which we used to preview our change and suggest it to the TEI
-community.
-
-[^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/)
-
-Despite it not getting a wide adhesion, some suggested it could be used locally
-within the scope of project DISCO-LGE. However we chose not to do so, partially
-for the same reasons of interoperability as the previous scenario, but also for
-reasons of sturdiness in front of future evolutions. Making sure the alternative
-schema would remain useful entails to maintain it, regenerating it should the
-schema format evolve, with the risk that the tools to edit it might change or
-stop being maintained.
-
-## Traitement Automatique de la Langue
-
-### Étiquetage morpho-syntaxique
-
-### Classification
-
-#### Related work {#sec:relatedworks}
-
-Document classification is a general problem in text analysis.
-Classification might mean assigning documents to a topic (infrastructure
-or foreign policy), a type of content (news or advertisement), or a type
-of author/speaker (Labor or Conservative). Text corpora similar to
-encyclopedias include collections of political speeches (like *Hansard*
-for the UK, the US Congressional Record, or the *Archives
-parlementaires* for France). Here we survey existing literature that
-classifies large historical text corpora using different methods.
-
-##### Classifying encylopedias
-
-In exploring methods for classifying *EDdA* articles, we follow in the
-footsteps of the ARTFL project. In their 2009 paper Hornton et al
-[@horton2009mining] tested Naive Bayesian classification on two
-tasks: 1) classifying the originally unclassified articles and 2)
-applying this model on the already classified articles to compare the
-results. This second task also enabled them to explore which words were
-most important for the classification result. While the paper did not
-include a formal evaluation of the performance of the model, it did
-offer an important close reading for a selection of the results. Later,
-Roe et al [@roe2016discourses] used Latent Dirichlet Allocation (LDA)
-topic modelling to analyze automatically-identified groups of articles,
-and to compare these to the original classes. This research posited that
-the LDA-identified topics could be understood as discourses that were
-woven throughout *EDdA* and did not always neatly map onto original
-classes. Our work is motivated by this earlier research. We aim to
-establish a baseline for the classification task which can be improved
-on in the future, and which can be compared when using different
-classification metadata to fine-tune models (e.g. original classes,
-ARTFL simplified *normclasses*, or ENCCRE domain ensembles.
-
-We also take inspiration from researchers working with other
-encyclopedias. The Nineteenth-Century Knowledge project explored
-rule-based and ML methods[^8] to index 400k articles across 4 editions
-of the Encyclopedia Britannica [@grabus_representing_2019].[^9] Because
-Britannica editors did not use the same article classes over time,
-matching articles with Library of Congress Subject Headings enables
-cross-edition comparison and therefore improved discovery.
-
-##### Classifying other texts
-
-Beyond encyclopedias, humanities research has largely used text
-classification for subject or genre detection ("is this historical
-fiction or biography?\") and author/group identification ("was this
-speech given by a Labour or Conservative MP?\"
-[@peterson_classification_2018]).
-
-The popularity of LDA topic modeling for assessing the content of large
-text data is at least in part explained by the fact that it does not
-require pre-existing metadata or new annotations describing documents or
-document sections that can be used as training data: it is quicker to
-implement. In her analysis of British parliamentary speeches (Hansard),
-Guldi [@guldi_parliaments_2019] employs topic modeling to "critically
-search\" for "tensions and turning points\" in political debates in the
-UK. Baron et al [@barron_individuals_2018] use topic modeling as a
-jumping off point from which to measure the "novelty\" and "transience\"
-of speeches made during the first years of the French Revolution. This
-is useful because, while the speeches are usually attributed to a
-specific deputy and are dated, there is no other metadata about each
-speech.
-
-Using both LDA and other ML models, Underwood examines the history and
-instability of literary genre
-[@underwood2018historical; @underwood_life_2016; @underwood2020machine]
-and finds that computational methods are useful because they can
-"register and compare blurry family resemblances that might be difficult
-to define verbally without reductiveness\" (6) [@underwood_life_2016].
-Such a quantitative, predictive approach to text classification enables
-computational humanities research to think through the results in a
-different kind of interpretative environment.
-
-What does this all mean for encyclopedias written in eighteenth-century
-France, and how does it impact our experiment design and interpretation?
-First, we emphasize again that encyclopedia classes are, like genre,
-culturally-constructed categories that change over time (even within the
-volumes of one publication!). Second, our ability to recreate these
-classes using models sheds light on the extent to which they hold fast
-to certain linguistic features and points us to specific subsets of the
-work that conform or do not conform to the predictions (e.g., by
-evaluating true positives vs. false positives).
-
-##### Working in French
-
-Our research uses texts written in French with a smattering of other
-languages (especially Latin and Greek) during the eighteenth century
-[@bender2019rule]. We use some language-dependent methods on language
-models pre-trained on French documents. For example, we use the French
-version of FastText with CNN and LSTM experiment, but also multilingual
-BERT and CamemBERT. It can no longer be said that French is a
-low-resource language in Natural Language Processing, but lack of
-linguistic diversity in NLP still plays a role in experiment design.
-Perhaps even more important is the historical nature of our texts. We
-therefore still face hurdles in model performance that do not exist when
-one is working with short, modern, English texts
-[@galina_russell_geographical_2014; @spence_towards_2021]. The
-experiments below focus specifically on methods for French texts: in
-expanding this research to enyclopedias in other languages, including
-English, different considerations would necessarily be required.
-
-### Topic-modeling
-
-ÉCRIT, À PRENDRE DE DKE
-
-+
-
-COMPLÉTER AVEC recherches sur Structural Topic-Modeling
-
-### NER
-
-À FAIRE
-
diff --git "a/\303\211dlA/Diachronie.md" "b/\303\211dlA/Diachronie.md"
new file mode 100644
index 0000000000000000000000000000000000000000..3ea3ec36e6735100d9d24e25be1218373c146459
--- /dev/null
+++ "b/\303\211dlA/Diachronie.md"
@@ -0,0 +1,17 @@
+## Diachronie
+
+### Diachronie
+
+@diwersy_ressources_2017 s'attache à montrer les difficultés rencontrées en
+français sur la période XVIème -> XVIIème (graphie, ordre des mots, tokenization).
+
+@mayaffre_explorer_2019 montre un usage possible 
+
+@mouhouche_etude_2014 application à la didactique et épistémologie. Étude de la
+terminologie en physique, verbe résonner, de l'origine accoustique jusqu'à
+l'application aux planètes pour véhiculer les notions d'accord et de transfert
+d'énergie. Pas de textométrie mais une analyse qualitative d'occurrences. Voire
+peut-être une référence à Gaston Bachelard sur la notion d'*obstacle verbal*
+(Bachelard 1928).
+
+
diff --git "a/\303\211dlA/Genre_textuel.md" "b/\303\211dlA/Genre_textuel.md"
new file mode 100644
index 0000000000000000000000000000000000000000..f7fcba367071a0e17ec2f3d26211c0d8e509407e
--- /dev/null
+++ "b/\303\211dlA/Genre_textuel.md"
@@ -0,0 +1,93 @@
+## Genre textuel
+
+### Saisir la notion de genre
+
+@beauvisage_2001 explore le genre policier ($\rightarrow$ à lire pour voir s'il
+y a une caractérisation intéressante de la notion de «genre»)
+
+### Le cas de la lexicographie
+
+Les dictionnaires entretiennent une relation étroite avec la notion de
+collocations et de phraséologismes: les entrées sont d'autant plus utiles qu'elles tiennent compte des
+phraséologismes existant dans la langue, des modèles de langue
+
+@zhu_discours_2022 s'intéresse à la structure propre aux dictionnaires qui met
+en relation un terme et une définition. @loiseau_dictionnaires_2011
+
+If encyclopedias are thus historically more recent than dictionaries they also
+depart from the latter on their approach. The purpose of dictionaries from their
+origin is to collect words, to make an exhaustive inventory of the terms used in
+a domain or in a language in order to associate a *definition* to them, be it a
+translation in another language for a foreign language dictionary or a phrase
+explaining it for other dictionaries. As such, they are collections of *signs*
+and remain within the linguistic level of things. Entries in a dictionary often
+feature information such as the part of speech, the pronunciation or the
+etymology of the word they define.
+
+The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
+types of dictionaries: one to define *words*, the second to define *facts* and
+the last one to define *things*, corresponding to the distinction between
+language, history, and science and arts dictionaries although according to its
+author, d'Alembert, each has to be of more than just one kind to be really good.
+In the full title of the *Encyclopédie*, the concept is more or less equated by
+means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*,
+"reasoned dictionary", introducing the idea of encyclopedias as dictionaries
+with additional structure and a philosophical dimension.
+
+Back to the "Encyclopédie" article we read that a dictionary remaining strictly
+at the language level, a vocabulary, can be seen as the empty frame required for
+an encyclopedic dictionary that will fill it with additional depth. Given how
+d'Alembert insists on the importance of brevity for a clear definition in the
+"Dictionnaire de Langues" entry, it is clear that the *encyclopédistes* did not
+consider encyclopedias superior to dictionaries but really as a new subgenre
+departing from them in terms of purpose.
+
+The first immediately visible feature that sets encyclopedias apart from
+dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
+Encyclopédie* is the presence of subject indicators at the beginning of articles
+right after the headword which organise them into a domain classification
+system. Those generally cover a broad range of subjects from scientific
+disciplines to litterature, and extending to political subjects and law.
+
+No element in the *dictionaries* module is explicitely designed for the purpose
+of encoding these indicators. As we have seen above, the elements set is geared
+towards the words themselves instead of the concept they represent. The closest
+tool for what we need is found in the `<usg/>` element used with a specific
+`type` attribute set to `dom` for "domain". Indeed several examples from the
+documentation encode subject indicators very similar to the ones found in
+encyclopedias within this element, but the match is not perfect either: all
+appear within one of multiple senses, as if to clarify each context in which the
+word can be used, as expected from the element's name, "usage". In
+encyclopedias, if the domain indicator does in certain cases help to distinguish
+between several entries sharing the same headword, the concept itself has
+evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the
+adjective *raisonné* in the rest of the title directly introduces a notion of
+structure that links back to the "Systême figuré des connoissances humaines"
+[@blanchard2002] which schematic structure is shown in Figure
+@fig:systeme_figure. The authors have devised a branching system to classify all
+knowledge, and the occurrence at the beginning of articles, more than a tool to
+clear up possible ambiguities also points the reader to the correct place in
+this mind map.
+
+!["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie](figure/systême_figuré.png){#fig:systeme_figure}
+
+The situation regarding subject indicators is hardly better outside of the
+module. The `<domain/>` element despite its name belongs exclusively in the
+header of a document and focuses on the social context of the text, not on the
+knowledge area it covers. The `<interp/>` despite its name is not so much about
+labeling something as an interpretation to give to a context (which subject
+indicators could be if you consider that, placed at the beginning, they are used
+to direct the mind frame of the readers towards a particular subject). However,
+the documentation clearly demonstrates it as a tool for annotators of a
+document, which text content is not part of the original document but some
+additional result of an analysis performed in the context of the encoding, used
+only throughout references in XML attributes.
+
+This point, although not the most concerning, still remains the hardest to
+address but all things considered the `<usg/>` element stands out as the most
+relevant.
+
+### Discours scientifique
+
+Étudié sous l'angle des ALR par @ji_hal_01956323
+
diff --git "a/\303\211dlA/G\303\251ographie.md" "b/\303\211dlA/G\303\251ographie.md"
new file mode 100644
index 0000000000000000000000000000000000000000..d9c95467afcf0bc882893cd8053efb124e26c883
--- /dev/null
+++ "b/\303\211dlA/G\303\251ographie.md"
@@ -0,0 +1,4 @@
+## La place de la géographie
+
+
+
diff --git "a/\303\211dlA/TAL.md" "b/\303\211dlA/TAL.md"
new file mode 100644
index 0000000000000000000000000000000000000000..840d76dfa866b9538e04431081ef21a249db8498
--- /dev/null
+++ "b/\303\211dlA/TAL.md"
@@ -0,0 +1,120 @@
+## Traitement Automatique de la Langue
+
+### Étiquetage morpho-syntaxique
+
+### Classification
+
+#### Related work {#sec:relatedworks}
+
+Document classification is a general problem in text analysis.
+Classification might mean assigning documents to a topic (infrastructure
+or foreign policy), a type of content (news or advertisement), or a type
+of author/speaker (Labor or Conservative). Text corpora similar to
+encyclopedias include collections of political speeches (like *Hansard*
+for the UK, the US Congressional Record, or the *Archives
+parlementaires* for France). Here we survey existing literature that
+classifies large historical text corpora using different methods.
+
+##### Classifying encylopedias
+
+In exploring methods for classifying *EDdA* articles, we follow in the
+footsteps of the ARTFL project. In their 2009 paper Hornton et al
+[@horton2009mining] tested Naive Bayesian classification on two
+tasks: 1) classifying the originally unclassified articles and 2)
+applying this model on the already classified articles to compare the
+results. This second task also enabled them to explore which words were
+most important for the classification result. While the paper did not
+include a formal evaluation of the performance of the model, it did
+offer an important close reading for a selection of the results. Later,
+Roe et al [@roe2016discourses] used Latent Dirichlet Allocation (LDA)
+topic modelling to analyze automatically-identified groups of articles,
+and to compare these to the original classes. This research posited that
+the LDA-identified topics could be understood as discourses that were
+woven throughout *EDdA* and did not always neatly map onto original
+classes. Our work is motivated by this earlier research. We aim to
+establish a baseline for the classification task which can be improved
+on in the future, and which can be compared when using different
+classification metadata to fine-tune models (e.g. original classes,
+ARTFL simplified *normclasses*, or ENCCRE domain ensembles.
+
+We also take inspiration from researchers working with other
+encyclopedias. The Nineteenth-Century Knowledge project explored
+rule-based and ML methods[^8] to index 400k articles across 4 editions
+of the Encyclopedia Britannica [@grabus_representing_2019].[^9] Because
+Britannica editors did not use the same article classes over time,
+matching articles with Library of Congress Subject Headings enables
+cross-edition comparison and therefore improved discovery.
+
+##### Classifying other texts
+
+Beyond encyclopedias, humanities research has largely used text
+classification for subject or genre detection ("is this historical
+fiction or biography?\") and author/group identification ("was this
+speech given by a Labour or Conservative MP?\"
+[@peterson_classification_2018]).
+
+The popularity of LDA topic modeling for assessing the content of large
+text data is at least in part explained by the fact that it does not
+require pre-existing metadata or new annotations describing documents or
+document sections that can be used as training data: it is quicker to
+implement. In her analysis of British parliamentary speeches (Hansard),
+Guldi [@guldi_parliaments_2019] employs topic modeling to "critically
+search\" for "tensions and turning points\" in political debates in the
+UK. Baron et al [@barron_individuals_2018] use topic modeling as a
+jumping off point from which to measure the "novelty\" and "transience\"
+of speeches made during the first years of the French Revolution. This
+is useful because, while the speeches are usually attributed to a
+specific deputy and are dated, there is no other metadata about each
+speech.
+
+Using both LDA and other ML models, Underwood examines the history and
+instability of literary genre
+[@underwood2018historical; @underwood_life_2016; @underwood2020machine]
+and finds that computational methods are useful because they can
+"register and compare blurry family resemblances that might be difficult
+to define verbally without reductiveness\" (6) [@underwood_life_2016].
+Such a quantitative, predictive approach to text classification enables
+computational humanities research to think through the results in a
+different kind of interpretative environment.
+
+What does this all mean for encyclopedias written in eighteenth-century
+France, and how does it impact our experiment design and interpretation?
+First, we emphasize again that encyclopedia classes are, like genre,
+culturally-constructed categories that change over time (even within the
+volumes of one publication!). Second, our ability to recreate these
+classes using models sheds light on the extent to which they hold fast
+to certain linguistic features and points us to specific subsets of the
+work that conform or do not conform to the predictions (e.g., by
+evaluating true positives vs. false positives).
+
+##### Working in French
+
+Our research uses texts written in French with a smattering of other
+languages (especially Latin and Greek) during the eighteenth century
+[@bender2019rule]. We use some language-dependent methods on language
+models pre-trained on French documents. For example, we use the French
+version of FastText with CNN and LSTM experiment, but also multilingual
+BERT and CamemBERT. It can no longer be said that French is a
+low-resource language in Natural Language Processing, but lack of
+linguistic diversity in NLP still plays a role in experiment design.
+Perhaps even more important is the historical nature of our texts. We
+therefore still face hurdles in model performance that do not exist when
+one is working with short, modern, English texts
+[@galina_russell_geographical_2014; @spence_towards_2021]. The
+experiments below focus specifically on methods for French texts: in
+expanding this research to enyclopedias in other languages, including
+English, different considerations would necessarily be required.
+
+### Topic-modeling
+
+ÉCRIT, À PRENDRE DE DKE
+
++
+
+COMPLÉTER AVEC recherches sur Structural Topic-Modeling
+
+### NER
+
+À FAIRE
+
+
diff --git "a/\303\211dlA/Textom\303\251trie.md" "b/\303\211dlA/Textom\303\251trie.md"
new file mode 100644
index 0000000000000000000000000000000000000000..90f9592f0956814b5f570f83ecd70891cadd2211
--- /dev/null
+++ "b/\303\211dlA/Textom\303\251trie.md"
@@ -0,0 +1,50 @@
+## Textométrie
+
+### Cadre
+
+Origine via l'«École Française» de Benzécri [@benzecri__analyse_1973] tout à
+fait du côté mathématique / statistiques. Initialement, ça ne concerne que les
+mots bruts (les formes), puis la technologie permet de traiter du texte annoté
+(morpho-syntaxe puis syntaxe), faisant émerger la linguistique de corpus
+[@nazarenko_hal_00619268].
+
+Différentes modèles de distribution statistique des mots sont employées: khi2,
+loi de Poisson. @lafon_sur_1980 propose l'emploi d'une loi hypergéométrique
+(choix qui restera dans la conception de TXM [@heiden2010]).
+
+L'ouvrage fondateur traite de l'utilisation des corpus annotés en commentant une
+étude de discours de Mitterrand [@Labb1983FranoisM] \(un précurseur du corpus
+des Vœux de TXM [@heiden2010] ?), puis des dimensions transversales et de l'usage
+contrastif dans le cadre d'études diachroniques et enfin traite de la
+constitution des corpus eux-même. L'horizon est à l'époque le million de mots
+(notre corpus parallèle, 8 millions de tokens).
+
+### Contrastes
+
+Sur la constitution des corpus @pincemin_heterogeneite_2012 avertit qu'il est
+plus qu'un agglomérat de textes, tout en mentionnant une approche *WAC*
+privilégiant les volumes sur une construction délibérée. Notre étude se situe un
+peu entre les deux j'imagine ? Pas de place pour des textes non-encyclopédiques
+pour contraster, et un peu les articles qu'on peut récupérer dans l'état dans
+lequel on peut les récupérer.
+
+@laramee_production_2017 emploie une démarche contrastive pour faire opposer les
+tomes de l'EDdA et mettre en évidence le rôle des différents auteurs.
+
+### Arbre lexico-syntaxiques récurrents
+
+On commence à mentionner dans @nazarenko_hal_00619268 des «stéréotypes»
+
+Ils sont basés sur les notions de collocations @fellbaum_idioms_2007 puis de motif
+@longree_les_2008
+
+sont un processus récursif et permettent de s'abstraire des réalisation de
+surface contigentes à une langue @tutin_routines_2016
+
+### Possibilités
+
+Des tournures de phrases peuvent être liées à des genres, ce qui peut être
+révélé par une étude contrastive @kraif_constructions_2016,
+@gonon_phraseologismes_2020 similaire à notre objectif.
+
+
diff --git "a/\303\211dlA/XML-TEI.md" "b/\303\211dlA/XML-TEI.md"
new file mode 100644
index 0000000000000000000000000000000000000000..96e25785f5505d5ae150e430352e8eefbe08d400
--- /dev/null
+++ "b/\303\211dlA/XML-TEI.md"
@@ -0,0 +1,111 @@
+## Encodage XML-TEI
+
+### Module *dictionaries*
+
+The XML-TEI standard has a modular structure consisting of optional parts each
+covering specific needs such as the physical features of a source document, the
+transcription of oral corpora or particular requirements for textual domains
+like poetry, or, in our case, dictionaries. After describing why the dedicated
+module was a natural candidate to meet our needs, we formalise tools from
+graph theory to browse the specifications of this standard in a rational way and
+explore this module in detail.
+
+### A good starting point
+
+Data produced in the context of a project such as DISCO-LGE cannot be useful to
+future scientific projects unless it is *interoperable* and *reusable*. These
+are the two last key aspects of the FAIR[^FAIR] principles (*findability*,
+*accessibility*, *interoperability* and *reusability*) which we strive to follow
+as a guideline for efficient and quality research. It entails using standard
+formats and a standard for encoding historical texts in the context of digital
+humanities is XML-TEI, collectively developped by the *Text Encoding Initiative*
+consortium which publishes a set of technical specifications under the form of
+XML schemas, along with a range of tools to handle them and training resources.
+
+[^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)
+
+The *dictionaries* module has been leveraged to encode dictionaries in projects
+NENUFAR[^NENUFAR] and BASNUM[^BASNUM] to encode respectively the *Petit Larousse
+Illustré* published by Pierre Larousse in 1905 [@bohbot2018], roughly
+contemporary to our target encyclopedia and the *Dictionnaire Universel* by
+Furetière, or rather its second edition edited by Henri Basnage de Beauval, an
+encyclopedic dictionary from the very early 18^th^ century [@williams2017].
+These successes made it a good starting point for our own encoding but the
+former does not have the encyclopedic dimension our corpus has and the latter is
+a much older text which had a tremendous influence on the european encyclopedic
+effort of the 18^th^ century but is not as clearly separated from the
+dictionaric stem as *La Grande Encyclopédie* is. For these reasons, we could not
+directly reuse the encoding schemes used in these projects and had to explore
+the XML-TEI schema systematically to devise our own.
+
+[^NENUFAR]: [https://cahier.hypotheses.org/nenufar](https://cahier.hypotheses.org/nenufar)
+[^BASNUM]: [https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003)
+
+The XML-TEI specification contains 590 elements, which are each documented on
+the consortium's website in the online reference pages. With an average of
+almost 80 possible child elements (79.91) within any given element, manually
+browsing such an massive network can prove quite difficult as the number of
+combinations sharply increases with each step.
+
+We transform the problem by representing this network as a directed graph, using
+elements of XML-TEI as nodes and placing edges if the destination node may be
+contained within the source node according to the schema. Please note that the
+word "element" is here used with the same meaning as in the TEI documentation to
+refer to the conceptual device characterised by a given tag name such as `p` or
+`div` and not to a particular instance of them that may occur in a given
+document. Figure @fig:dictionaries-subgraph, by using this transformation to
+display the *dictionaries* module, hints at the overall complexity of the whole
+specification.
+
+![The subgraph of the *dictionaries* module](figure/dictionaries.png){#fig:dictionaries-subgraph}
+
+### Application à la lexicographie
+
+The previous section about the structure of the *dictionaries* module and the
+features found in encyclopedias follows quite closely our own journey trying to
+encode first manually then by automatic means the articles of our corpus. This
+back and forth between trying to find patterns in the graph which reflects the patterns
+found in the text and questioning the relevance of the results explains the
+choice we ended up making but also the alternatives we have considered.
+
+#### Bend the semantics
+
+Several times, the issue of the semantics of some elements which posess the
+properties we need came up. This is the case for instance of the `<sense/>` and
+`<node/>` elements. It is very tempting to bend their documented semantics or to
+consider that their inclusion properties is part of what defines them, and hence
+justifies their ways in creative ways not directly recommended by the TEI
+specifications.
+
+This is the approach followed by project BASNUM[^BASNUM]. In the articles
+encoded for this project, `<note/>` elements are nested and used to structure
+the encyclopedic developments that occur in the articles.
+
+We have chosen not to follow the same path in the name of the FAIR principles to
+avoid the emergence of a custom usage differing from the documented one.
+
+#### Custom schema
+
+The other major reason behind our choice was the inclusion rules which exist
+between TEI elements and pushed us to look for different combinations. Another
+valid approach would have consisted in changing the structure of the inclusion
+graph itself, that is to say modify the rules. If `<entry/>` is the perfect
+element to encode article themselves, all that is really missing is the ability
+to accomodate nested structures with the `<div/>` element. This would also have
+the advantage of recovering the `<usg/>` and `<xr/>` elements which we have
+recognised as useful and which we lose as part of the tradeoff to get nested
+sections. Generating customised TEI schemas is made really easy with tools like
+ROMA[^ROMA], which we used to preview our change and suggest it to the TEI
+community.
+
+[^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/)
+
+Despite it not getting a wide adhesion, some suggested it could be used locally
+within the scope of project DISCO-LGE. However we chose not to do so, partially
+for the same reasons of interoperability as the previous scenario, but also for
+reasons of sturdiness in front of future evolutions. Making sure the alternative
+schema would remain useful entails to maintain it, regenerating it should the
+schema format evolve, with the risk that the tools to edit it might change or
+stop being maintained.
+
+
diff --git "a/\303\211dlA/text.sh" "b/\303\211dlA/text.sh"
new file mode 100755
index 0000000000000000000000000000000000000000..47c073f00b1ca4f6c3d73892a7112c4353a979fd
--- /dev/null
+++ "b/\303\211dlA/text.sh"
@@ -0,0 +1,10 @@
+#!/bin/sh
+
+source ./chapter.sh "État de l'Art"
+
+cat ÉdlA/Textométrie.md
+cat ÉdlA/Géographie.md
+cat ÉdlA/Genre_textuel.md
+cat ÉdlA/Diachronie.md
+cat ÉdlA/XML-TEI.md
+cat ÉdlA/TAL.md