diff --git a/Bibliographie.md b/Bibliographie.md deleted file mode 100644 index b87ff7921d254227b9893be02a7b3f06ceec9b55..0000000000000000000000000000000000000000 --- a/Bibliographie.md +++ /dev/null @@ -1 +0,0 @@ -# Bibliography diff --git a/Conclusion.md b/Conclusion.md deleted file mode 100644 index 7d9035eb20b43744c18e25c5892274f2fdb8252f..0000000000000000000000000000000000000000 --- a/Conclusion.md +++ /dev/null @@ -1,5 +0,0 @@ -# Conclusion {-} - -## Regrets - -## Souhaits diff --git a/Conclusion/Regrets.md b/Conclusion/Regrets.md new file mode 100644 index 0000000000000000000000000000000000000000..937b53d7d56191395d49431b11f9a6ff7f91bbc2 --- /dev/null +++ b/Conclusion/Regrets.md @@ -0,0 +1,3 @@ +## Regrets + + diff --git a/Conclusion/Souhaits.md b/Conclusion/Souhaits.md new file mode 100644 index 0000000000000000000000000000000000000000..f65313cc8c3f75ff70ebf890ddf4c6a2f85bb527 --- /dev/null +++ b/Conclusion/Souhaits.md @@ -0,0 +1,2 @@ +## Souhaits + diff --git a/Conclusion/text.sh b/Conclusion/text.sh new file mode 100755 index 0000000000000000000000000000000000000000..795bfb5fc6a46469fb6ca41a8c79ead71568832b --- /dev/null +++ b/Conclusion/text.sh @@ -0,0 +1,6 @@ +#!/bin/sh + +source ./chapter.sh 'Conclusion {-}' + +cat Conclusion/Regrets.md +cat Conclusion/Souhaits.md diff --git "a/Contrastes/Centralit\303\251.md" "b/Contrastes/Centralit\303\251.md" new file mode 100644 index 0000000000000000000000000000000000000000..52206935d34835daab5e3d2d2b9e0f8738591919 --- /dev/null +++ "b/Contrastes/Centralit\303\251.md" @@ -0,0 +1,7 @@ +## Statistiques + +### Mesure de centralité + +(DKE) + + diff --git a/Contrastes.md "b/Contrastes/Lexicom\303\251trie.md" similarity index 63% rename from Contrastes.md rename to "Contrastes/Lexicom\303\251trie.md" index b0c0d1c3dc143ab6554f878b7c4e28c656e8fed3..1338f67480e7986369542ea6c493fbbf9f6b83f2 100644 --- a/Contrastes.md +++ "b/Contrastes/Lexicom\303\251trie.md" @@ -1,6 +1,3 @@ - -# Études contrastives - ## Analyse lexico-grammaticale (Lexicométrie, Textométrique, ?…) ### Contrastes Internes @@ -19,11 +16,4 @@ Np vs. Nc #### Adjectifs préférés -## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents - -## Statistiques - -### Mesure de centralité - -(DKE) diff --git "a/Contrastes/Phras\303\251ologie.md" "b/Contrastes/Phras\303\251ologie.md" new file mode 100644 index 0000000000000000000000000000000000000000..6f1f4c2391b76c241cf8dc9a9291fe9eed0e1d7d --- /dev/null +++ "b/Contrastes/Phras\303\251ologie.md" @@ -0,0 +1,3 @@ +## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents + + diff --git a/Contrastes/text.sh b/Contrastes/text.sh new file mode 100755 index 0000000000000000000000000000000000000000..b688ad4fbb72bcbd00aebb622143d7b7837782dc --- /dev/null +++ b/Contrastes/text.sh @@ -0,0 +1,7 @@ +#!/bin/sh + +source ./chapter.sh 'Études contrastives' + +cat Contrastes/Lexicométrie.md +cat Contrastes/Phraséologie.md +cat Contrastes/Centralité.md diff --git a/Corpus/Annotation.md b/Corpus/Annotation.md new file mode 100644 index 0000000000000000000000000000000000000000..a4ed6cebca3a0d92445d78e8c1449cbe30ec4dcf --- /dev/null +++ b/Corpus/Annotation.md @@ -0,0 +1,17 @@ +## Annotation en parties de discours et syntaxe + +### Jeu d'étiquettes + +Nous utilisons le [jeu d'étiquettes]() du projet +[PRESTO](http://presto.ens-lyon.fr/) + +Alors non en fait Stanza c'est bien aussi avec les +[UPOS](https://universaldependencies.org/docs/u/pos/) + +### Chaînes de traitement + +- PRESTO +- Stanza + + + diff --git a/Corpus.md b/Corpus/Domaines.md similarity index 50% rename from Corpus.md rename to Corpus/Domaines.md index b36424d96699ab037a1aff74c5a9a6688cf300f8..ad3226dba65f4207dcc1915b567054d8068451e9 100644 --- a/Corpus.md +++ b/Corpus/Domaines.md @@ -1,726 +1,3 @@ -# Préparation et enrichissement du corpus - -## Formats et états des textes - -### L'Encyclopédie - -In common parlance, the terms "dictionaries" and "encyclopedias" are used as -near synonyms to refer to books compiling vast amounts of knowledge into lists -of definitions ordered alphabetically. Their similarity is even visible in the -way they are coordinated in the full title of the *Encyclopédie* which is -probably the most famous work of the genre and a symbol of the Age of -Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it -was much more unusual and in fact controversial when Diderot and d'Alembert -decided to use it in the title of their book. - -The definition given by Furetière in his *Dictionnaire Universel* in 1690 is -still close to its greek etymology: a "ring of all knowledges", from *κÏκλος*, -"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance -by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened -to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of -Encyclopedia"). At the time the word still mostly refers to the abstract concept -of mastering all knowledges at once. Furetière adds that it's a quality one -is unlikely to possess, and even seems to condemn its search as a form of -hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" -("it is a recklessness for a man to want to possess Encyclopedia"). - -Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated -at the end of the 17^th^ century and attacked in the -*Dictionnaire Universel François et Latin*, commonly refered to as the -*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for -"Encyclopédie" remained unchanged in the four editons issued between 1721 and -1752, mocking the use of the word and discouraging his readers to pursue it. In -that intent, he quotes a poem from Pibrac encouraging people to specialise in -only one discipline lest they should not reach perfection, based on an -argumentation that resembles the saying "Jack of all trades, master of none". It -is all the more interesting that the definition remains unaltered until 1752, -one year after the publication of the first volume of the *Encyclopédie*. The -Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the -*Encyclopédie* which they managed to get banned the same year by the Council of -State on the charge of attempting to destroy the royal authority, inspiring -rebellion and corrupting morality in general. There is much more at stake than -words here, but the attempt to deprecate the word itself is part of their fight -against the philosophers of the Enlightenment. - -The attacks do not remain ignored by Diderot who starts the very definition of -the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He -directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as -mere self-doubt that their authors should not generalise to anyone, then leaves -the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues -that a collaborative work can achieve much more than any talented man could: -what could possibly not be within reach of a single man, within a single -lifetime may be achieved by a common effort throughout generations. - -History hints that Diderot's opponents took his defence of the feasability of -the project quite seriously, considering the fact that they got the -*Encyclopédie*'s privileges to be revoked again six years after its publication -was resumed [@moureau2001]. As a consequence, the remaining ten volumes -containing the text of the articles had to be published illegally until 1765, -thanks to the secret protection of Malesherbes who — despite being head of royal -censorship — saved the manuscripts from destruction. They were printed secretly -outside of Paris and the books were (falsely) labeled as coming from Neufchâtel. -Following the high demand from the booksellers who feared they would lose the -money they had invested in the project, a special privilege was issued for the -volumes containing the plates, which were released publicly from 1762 to 1772. - -In any case, in their last edition in 1771 the authors of the *Dictionnaire de -Trevoux* had no choice but to acknowledge the success of the encyclopedic -projects of the 18^th^ century. In this version, the definition -was entirely reworked, mildly stating that good encyclopedias are difficult to -make because of the amount of knowledge necessary and work needed to keep up -with scientific progress instead of calling the effort a parody. It credits -Chamber's *Cyclopædia* for being a decent attempt before referring anonymously -though quite explicitly to Diderot and d'Alembert's project by naming the -collective "Une Société de gens de Lettres" and writing that it started in 1751. -Even more importantly, two new entries were added after it: one for the -adjective "encyclopédique" and another one for the noun "encyclopédiste", -silently admitting how the project had changed its time and the relation to -knowledge itself. - -#### Contexte de l'Å“uvre - -#### Versions disponibles - -L'ARTFL[^ARTFL] en propose une version. - -[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/) - -#### Traitements - -### La Grande Encyclopédie - -#### Contexte de l'Å“uvre - -*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des -Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut -publiée en France entre 1885 et 1902 par une équipe de plus de deux cent -spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ -1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise -encyclopédique française majeure à marcher dans les traces de l'ancêtre -prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt. - -Le titre complet de l'Å“uvre, déjà , montre sa volonté de filiation avec l'*EDdA*, -volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022]. - -#### Versions disponibles - -Une version numérique de cette Å“uvre a été réalisée par la BnF et mise en -ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition -originale, elle comprend une image par page de l'Å“uvre, numérisée en niveau de -gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été -tirée une version partielle du texte par application d'un programme de -reconnaissance optique de caractères ([@=OCR]). Cette version présente un -certains nombre de limite qui empêchait de mener une étude intégrale du texte -par des moyens automatiques comme la textométrie. - -[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071) - -D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés, -certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun -texte n'est disponible pour ces volumes sur le site de -Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive -mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette -à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été -choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas -contigus ni au début ni à la fin de l'Å“uvre. Ensuite, cette version en «texte -brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une -annotation très superficielle et n'est en particulier par segmentée en article. -Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque -l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener -une étude contrastive en groupant les articles par domaine de connaissance ou -par auteur que d'observer la structure des domaines en comparant entre deux -encyclopédies quels articles ont été conservés ou non, et le cas échéant si le -domaine de connaissance qui leur est associé est le même. Enfin, des erreurs -dans la détection de l'organisation de la page ([@=OLR]) obscurcissent -significativement le texte en opérant des permutations locales de son contenu -qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui -complique nettement la segmentation du texte en article — et dans tous les cas -endommager la structure des phrases, ce qui est vient introduire des erreurs -dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que -nous avons besoin d'appliquer au texte pour faire de la textométrie. - -[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#) - -Dans le but de pallier à ces défauts, le projet CollEx Persée -DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en -partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette -nouvelle version a été réalisée à partir de photographies d'un exemplaire -original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal]. - -[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/) -[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t) -[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal) - -Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala], -le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num, -d'une nouvelle version de l'Å“uvre sous différents formats. - -[^nakala]: [https://nakala.fr/](https://nakala.fr/) - -#### Encodage - -##### Structure du module *dictionaries* - -**Definitions** - -By iterating several times the operation of moving on that graph along one edge, -that is, by considering the transitive closure of the relation "be connected by -an edge" we define *inclusion paths* which allow us to explore which elements -may be nested under which other. - -The nodes visited along the way represent the intermediate XML elements to -construct a valid XML tree according to the TEI schema. Given the top-down -semantics of those trees, we call the length of an inclusion path its *depth*. - -The ability for an element to contain itself corresponds directly to loops on -the graph (that is an edge from a node to itself) as can be illustrated by the -`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain -another one. - -The generalisation of this to inclusion paths of any length greater than one is -usually called a cycle and we may be tempted in our context to refine this and -name them *inclusion cycles*. The `<address/>` element provides us with an -example for this configuration: although an `<address/>` element may not -directly contain another one, it may contain a `<geogName/>` which, in turn, may -contain a new `<address/>` element. From a graph theory perspective, we can say -that it admits an inclusion cycle of length two. - -**Applications** - -Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59] -allows us to explore the shortest inclusion paths that exist between elements. -Though a particular caution should be applied because there is no guarantee that -the shortest path is meaningful in general, it at least provides us with an -efficient way to check whether a given element may or not be nested at all under -another one and gives a lower bound on the length of the path to expect. Of -course the accuracy of this heuristic decreases as the length of the elements -increases in the perfect graph representing the intended, meaningful path -between two nodes that a human specialist of the TEI framework could build. - -This is still very useful when taking into account the fact that TEI modules are -merely "bags" to group the elements and provide hints to human encoders about -the tools they might need but have no implication on the inclusion paths between -elements which cross module boundaries freely. The general graph formalism -enables us to describe complex filtering patterns and to implement queries to -look for them among the elements exhaustively by algorithmic means even when the -shortest-path approach is not enough. - -For instance, it lets one find that although `<pos/>` may not be directly -included within `<entry/>` elements to include information about the -part-of-speech of the word that an article defines, the correct way to do so is -through a `<form/>` or a `<gramGrp/>`. - -On the other hand, trying to discover the shortest inclusion path to `<pos/>` -from the `<TEI/>` root of the document yields a `<standOff/>`, an element -dedicated to store contextual data that accompanies but is not part of the text, -not unlike an annex, and widely unrelated to the context of encoding an -encyclopedia. - -A last relevant example on the use of these methods can be given by querying the -shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it -yields an inclusion directly through `<entryFree/>` (with an inclusion path of -length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly -not what we want depending on the regularity of the articles we are encoding and -the occurrence of other grammatical information such as `<case/>` or `<gen/>` to -justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to -length 3 returns as expected the path through `<entry/>`, among others. Overall, -we get a good general idea: `<pos/>` does not need to be nested very deep, it -can appear quite near the "surface" of article entries. - -##### Limites - -###### The `<entry/>` element - -The central element of the *dictionaries* module is the `<entry/>` element meant -to encode one single entry in a dictionary, that is to say a head word -associated to its definition. It is the natural way in from the `<body/>` -element to the dictionary module: indeed, although `<body/>` may also contain -`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of -`<entry/>` while the latter is a device to group several related entries -together. Both can contain an `<entry/` directly while no obvious inclusion -exists the other way around: most (> 96.2%) of the inclusion paths of -"reasonable" depth (which we define as strictly inferior to 5, that is twice the -average shortest depth between any two nodes) either include `<figure/>` or -`<castList/>`, two very specific elements which should not need to appear in an -article in general, showing that the purpose of `<entry/>` is not to contain an -`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the -documentation but also the structure of the elements graph evidence `<entry/>` -as the natural top-most element for an article. This somewhat contrived example -hopes to further demonstrate the application of a graph-centred approach to -understand the inner workings of the XML-TEI schema. - -###### Information about the headword itself - -Once a block for an article is created, it may contain elements useful to -represent various of its features. Its written and spoken forms are usually -encoded by `<form/>` elements. Grammatical information like the `<case/>`, -`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`, -along with information about the categories it belongs to like `<iType/>` for -its inflection class in languages with a declension system or `<pos/>` for its -part-of-speech. The `<etym/>` element is made to hold the etymology of an entry. -In the case when there are alternative spellings in varieties of the language or -if the spelling has changed over time, `<usg/>` can be used. - -All these examples are by no means an exhaustive list; the complete set provides -the encoder with a toolbox to describe all the information related to the form -the entry is found at and seems general enough to accomodate the structure of -any book indexing entries by words. - -###### Cross-references - -A common feature shared by dictionaries and encyclopedias is the ability to -connect entries together by using a word or short phrase as the link, referring -the reader to the related concept. This is known as cross-references and can -appear either when the definition of a term is adjacent to another one or to -catch alternative spellings where some readers might expect to find the word and -redirect them to the form chosen as the reference. In XML-TEI, this is done with -the `<xr/>` element. It usually contains the whole phrase performing the -redirection, with an imperative locution like "please see […]". - -The "active" part of the cross-reference, that is the very word within the -`<xr/>` that is considered to be the link or, to make a modern-day HTML -metaphor, the region that would be clickable, is represented by a `<ref/>` -element. Though it is not specific to the *dictionaries* module, we include it -in this description of the toolbox because it is particularly useful in the -context of dictionaries. This element may have a target attribute which points -to the other resource to be accessed by the interested reader. - -###### Definitions - -The remaining part of entries is also usually the largest and represents the -content associated to the headword by the entry. In a dictionary, that is its -meaning. - -The `<sense/>` element is a valid child for `<entry/>` and groups together a -definition of the term with `<def/>`, usage examples with `<usg/>` (another use -of this versatile element) and other high-level information such as translations -in other languages. Both `<def/>` and `<usg/>` elements may appear directly -under the `<entry/>`. - -###### Structural remarks - -Before concluding this description of the *dictionaries* module from the -perspective of someone trying to concretely encode a particular dictionary or -encyclopedia, we make use of the graph approach again to evidence some its -aspects in terms of inclusion structure. - -First, it is remarkable that all elements in the *dictionaries* module have a -cyclic inclusion path, that is to say, there is an inclusion path from each -element of this module to itself. Although having such a cycle is a widespread -property in the remainder of XML-TEI elements shared by 73.8% of them (411 out -of the 557 elements in the other modules), all 33 elements of the *dictionaries* -module having one is far above this average. In addition, the cycles appear to -be rather short, with an average length of 2.00 versus 2.50 in the rest of the -population. This observation is all the more surprising considering the fact -that the *dictionaries* module contains short "leaf" elements like `<pos/>` -which should not obviously need to admit cycles since one rather expects them to -contain only one word, like `<pos>adj</pos>` in the example given in the -official documentation. Among those (shortest) cycles, 20 include the `<cit/>` -element made to group quotations with a bibliographic reference to their source -which should clearly be unnecessary to encode an article in the general case. - -Secondly, although we have seen examples of connections from this module to the -rest of the XML-TEI, especially to the *core* module (see the case of the -`<ref/>` element above), the *dictionaries* module appears somewhat isolated -from important structural elements like `<head/>` or `<div/>`. Indeed, computing -all the paths from either `<entry/>` or `<sense/>` elements to the latter of -length shorter or equal to 5 by a systematic traversal of the graph yields -exclusively paths (respectively 9042 and 39093 of them) containing either a -`<floatingText/>` or an `<app/>` element. The first one, as its name aptly -suggests, is used to encode text that does not quite fit the regular flow of the -document, as for example in the context of an embedded narrative. Both examples -displayed in the online documentation feature a `<body/>` as direct child of -`<floatingText/>`, neatly separating its content as independent. The purpose of -the second one, although its name — short for apparatus — is less clear, is to -wrap together several versions of the same excerpts, for instance when there are -several possible readings of an unclear group of words in a manuscript, or when -the encoder is trying to compile a single version of a piece of work from -several sources which disagree over some passage. In both case, it appears -obvious that it is not something that is expected to occur naturally in the -course of an article in general. - -Thus, despite a rather dense internal connectivity, the *dictionaries* module -fails to provide encoders with a device to represent recursively nesting -structures like `<div/>`. - -The situation regarding subject indicators is hardly better outside of the -module. The `<domain/>` element despite its name belongs exclusively in the -header of a document and focuses on the social context of the text, not on the -knowledge area it covers. The `<interp/>` despite its name is not so much about -labeling something as an interpretation to give to a context (which subject -indicators could be if you consider that, placed at the beginning, they are used -to direct the mind frame of the readers towards a particular subject). However, -the documentation clearly demonstrates it as a tool for annotators of a -document, which text content is not part of the original document but some -additional result of an analysis performed in the context of the encoding, used -only throughout references in XML attributes. - -This point, although not the most concerning, still remains the hardest to -address but all things considered the `<usg/>` element stands out as the most -relevant. - -###### The notion of meaning - -Notwithstanding the correct way to represent domains of knowledge, their extent -itself raises concerns regarding the *dictionaries* module. Indeed, among the -vast collection of domains covered in encyclopedias in general and in *La Grande -Encyclopédie* in particular are historical articles and biographies. If the -notion of meaning can appear at least ill-fitting for a text describing a series -of historical events, one may still argue that it groups them into a concept and -associates it to the name of the event. But when it comes to relating the life -of a person, describing their relation to events and other persons comes out -even further from the notion of meaning. Entries such as the one about SANJO -Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*. - -{#fig:sanjo} - -Moreover, encyclopedias, because of all that they have inherited from the -philosophical Enlightenment, are not only spaces designed to assert, they also -intrinsically include an interrogative component. Some articles lay down the -basis required to understand the complexity of an issue and invite the reader to -consider it without providing a definitive answer, going as far as to explicitly -use question marks as in the article "Action" displayed in Figure @fig:action. - -{#fig:action} - -In this extract, the author devises a hypothetical situation to illustrate how -difficult it is to draw the line between two supposedly mutually exclusive -subcategories of legal actions. The whole point of the passage is to convey the -idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a -`<def/>` element would be an utter misnomer. - -As a result, the use of `<sense/>` and `<def/>` is not appropriate for -encyclopedic content in general. - -###### Nested structures - -The final difficulty can be considered as a partial consequence of the previous -one on the structure of articles. The difficulty to define complex concepts is -the very reason why authors approach their subjects from various angles, -circumnavigating it as a best approximation. This strategy favours long, -structured developments with sections and subsections covering the multiple -aspects of the topic: from a historical, political, scientific point of view… -The longest articles, such as article "Europe" shown in Figure @fig:europe, can -thus span several dozens of pages. They can contain substructures with titles on -at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of -which are in turn generally developed over several paragraphs. - -{#fig:europe} - -The nested structure that we have just evidenced demands of course a nesting -structure to accomodate it. More precisely it guides our search of XML elements -by giving us several constraints: we are looking for a pair of elements, the -first representing a (sub)section must be able to include both itself and the -second element, which does not have any special constraint except the one to -have a semantics compatible with our purpose of using it to represent section -titles. In addition, the first element must be able to contain several `<p/>` -elements, `<p/>` being the reference element to encode paragraphs according to -the XML-TEI documentation. - -We have seen that the *dictionaries* module was equiped with a questionable but -possible element for subject domains. However, it does not include any element -for section titles. In the rest of the TEI specification, the elements `<head/>` -and `<title/>` — the latter with the possibility to set its `type` attribute to -`sub` — stand out as the best candidates for the semantics condition on the -second element. - -##### Choix - -###### Candidates in the *dictionaries* module - -Filtering the content of the module to keep only the elements which can at the -same time contain themselves, be included under `<entry/>` and include a `<p/>` -and either the `<head/>` or `<title/>` elements yields absolutely no candidates. -It is remarkable that even replacing the `<entry/>` element for the root of each -article with an `<entryFree/>`, an element supposed to relax some constraint to -accomodate more unusual structure in dictionaries does not bring any -improvement. - -The lack of results from these simple queries forces us to somewhat release the -constraints on the encoding we are willing to use. We can for instance make the -asumption that the occurrence of an intermediate element could be needed between -the element wrapping the whole article and the recursing one used to encode each -section. This "section" element could also need a companion element to be able -to include itself, or, to formalise it in terms of graph theory, we could relax -the condition that this element admits a loop to consider instead cycles of a -given (small, this still needs to represent a fairly direct inclusion) length to -be enough. We simultaneously extend the maximum depth of the inclusion paths we -are looking for between `<entry/>`, the pair of elements and the `<p/>` element. - -By setting this depth to 3, that is, by accepting one intermediate element to -occur in the middle of each one of the inclusion paths that define the structure -required to encode encyclopedic discourse, we find 21 elements but none of them -stand out as an obvious good solution: all paths to include the `<p/>` element -from any *dictionaries* element either contains a `<figure/>` (which we have -encountered earlier when we were practising our graph approach to search for -inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in -general), a `<stage/>` (reserved to stage direction in dramatic works) or a -`<state/>` (used to describe a temporary quality in a person or place), again -not even close to what we want. The paths to either `<head/>` or `<title/>` are -similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns -the exact same candidates. If that is not a thorough proof that none of these -elements could fulfill our purpose, it is a fact than no element in this module -appears as an obvious good solution and a serious hint to keep looking somewhere -else. - -###### Widening the search - -We hence widen our search to include elements outside the *dictionaries* module -which could be used to encode our sections and subsections, under the same -constraint as before to try and find a composite solution that would remain -under the `<entry/>` element even if resorting to subcomponents outside of the -dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>` -and `<note/>`. - -The first one as we have repeatedly underlined is meant for graphic information -and is not suitable for text content in general. - -The purpose of `<metamark/>` is to transcribe the edition marks than may appear -on a particular primary source in order to alter the normal flow of the text and -suggest an alternative reading (deletion, insertion, reordering, this is about a -human editing the text from a given physical copy of it), but it is -unfortunately of no use to encode a section of an article. - -The first element that might at least resemble what we are looking for is the -last one, `<note/>`. It is meant to contain text, is about explaning something -and seems general enough (not specific to a given genre, or to the occurrence of -a particular object on the page). Unfortunately, its semantics still seems a bit -off compared to our need. The documentation describes it as an "additional -comment" which appears "out of the main textual stream" whereas the long -developments in articles are the very matter of the text of encyclopedias, not -mere remarks in the margins or at the foot of pages. - -##### Implémentation - -The above remarks explain why the *dictionary* module is unable to represent -encyclopedias, where the notion of "meaning" is less central that in -dictionaries and where discourse with nested structures of arbitrary depth can -occur. Even composite encodings using elements outside of the *dictionaries* -module under an `<entry/>` element do not meet our requirements. Since the -*core* module of course accomodates these structures by means of the `<div/>`, -`<head/>` and `<p/>` elements which have the additional advantage of carrying -less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme -using them which we recommend using for other projects aiming at representing -encyclopedias. - -To remain consistent with the above remarks we will only concern ourselves with -what happens at the level of each article, right under the `<body/>` element. -Everything related to metadata happens as expected in the file's `<teiHeader/>` -which is well-enough equiped to handle them. In order to present our scheme -throughout the following section we will be progressively encoding a reference -article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo. - -{#fig:cathete-photo} - -###### The scheme - -Remaining within the *core* module for the structure, almost all useful elements -are available and our encoding scheme merely quotes the official documentation. -Each article is represented by a `<div/>`. We suggest setting an `xml:id` -attribute on it with the head word of the entry — unique in the whole corpus, or -made so by suffixing a number representing its rank among the various -occurrences, even when there's only one for the sake of regularity — as its -value, normalised to lowercase, stripping spaces and replacing all -non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML -encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container -element on the article "Cathète" previously displayed. - -{#fig:cathete-xml-0} - -Inside this element should be a `<head/>` enclosing the headword of the article. -The usual sub-`<hi/>` elements are available within `<head/>` if the headword is -highlighted by any special typographic means such as bold, small capitals, etc. -The one disappointment of the encoding scheme we are defining in this chapter is -the lack of support for a proper way to encode subject indicators. - -The best candidate we have found so far was `<usg/>` from the *dictionaries* -module but it is not available directly under a `<head/>` element. All inclusion -paths from the latter to the former of length less than or equal to 3 contain -irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it -must be discarded. The next best elements appear to be `<term/>` (not very -accurate) and `<rs/>` ("referring string", quite a general semantics but a -possible match — subject indicators refer to a given domain of knowledge — -although all the examples in the documentation refer to concrete persons, -places or object, not to the abstract objects that mathematics or poetry are). - -For this reason, we do not recommend any special encoding of the subject -indicator but leave it open to each particular context: they are often -abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies -are not labeled by a knowledge domain but usually include the first name of the -person when it is known so in that case an element like `<persName/>` is still -appropriate. This choice applied to the same article "Cathète" produces Figure -@fig:cathete-xml-1. - -{#fig:cathete-xml-1} - -We then propose to wrap each different meaning in a separate `<div/>` with the -`type` attribute set to `sense` to refer to the `<sense/>` element that would -have been used within the *core* module. The `<div/>`s should be numbered -according to the order they appear in with the `n` attribute starting from `0` -as shown in Figure @fig:cathete-xml-2. - -{#fig:cathete-xml-2} - -In addition, each line within the article must start with a `<lb/>` to mark its -beginning including before the `<head/>` element as demonstrated by Figure -@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that -in the dense layout of encyclopedias, the carriage return separating two -articles is meaningful. Stating each new line explicitly keeps enough -information to reconstruct a faithful facsimile but it also has the advantage of -highlighting the fact than even though the definition is cut from the headword -by being in a separate XML element, they still occur on the same line, which is -a typographic choice usually made both in encyclopedias and dictionaries where -space is at a premium. . - -To complete the structure, the various sections and subsections occurring -within the article body may be nested as usual with `<div/>` and sub-`<div/>`s, -filled with `<p/>` for paragraphs which can each be titled with `<head/>` -elements local to each `<div/>`. - -{#fig:cathete-xml-3} - -Some articles such as "Boumerang" have figures with captions, as illustrated by -Figure @fig:boumerang-photo, which should be encoded the standard way by -`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml. - -{height=300px #fig:boumerang-photo} - -{#fig:boumerang-xml} - -Another issue arising from giving up on `<entry/>` is the unavailability of the -`<xr/>` element, not allowed under any of the *core* elements we use but which -is useful to represent cross-references occurring in encyclopedias as well as in -dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo). -We prefer to use the `<ref/>` element instead which is available in the context -of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the -article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml. -Another solution would have been to introduce a `<dictScrap/>` element for the -sole purpose of placing an `<xr/>` but we advocate against it on account of the -verbosity it would add to the encoding and the fact that it implicitly suggests -that the previous context was not the one of a dictionary. - -{#fig:gelocus-photo} - -{#fig:gelocus-xml} - -A typical page of an encyclopedia also features peritext elements, giving -information to the reader about the current page number along with the headwords -of the first and last articles appearing on the page. Those can be encoded by -`<fw/>` elements ("forme work") which `place` and `type` attributes should be -set to position them on the page and identify their function if it has been -recognised (those short elements on the border of pages are the ones typically -prone to suffer damages or be misread by the OCR). - -Finally there are other TEI elements useful to represent "events" in the flow of -the text, like the beginning of a new column of text or of a new page. Figure -@fig:alcala-photo shows the top left of the last page of the first tome of *La -Grande Encyclopédie* which features peritext elements while marking the -beginning of a new page. The usual appropriate elements (`<pb/>` for page -beginning, `<cb/>` for column beginning) may and should be used with our -encoding scheme as demonstrated by Figure @fig:alcala-xml. - -{width=350px #fig:alcala-photo} - -{#fig:alcala-xml} - -###### Currently implemented - -The reference implementation for this encoding scheme is the program -soprano[^soprano] developed within the scope of project DISCO-LGE to -automatically identify individual articles in the flow of raw text from the -columns and to encode them into XML-TEI files. Though this software has already -been used to produce the first TEI version of *La Grande Encyclopédie*, it does -not yet follow the above specification perfectly. Figure -@fig:cathete-xml-current shows the encoded version of article "Cathète" it -currently produces: - -[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano) - -{#fig:cathete-xml-current} - -The headword detection system is not able to capture the subject indicators yet -so it appears outside of the `<head/>` element. No work is performed either to -expand abbreviations and encode them as such, or to distinguish between domain -and people names. - -Likewise, since the detection of titles at the beginning of each section is not -complete, no structure analysis can be performed at the moment on the textual -development inside the article and it is left unstructured, directly under the -entry's `<div/>` element instead of under a set of nested `<div/>` elements. The -paragraphs are not yet identified and for this reason not encoded. - -However, the figures and their captions are already handled correctly when they -occur. The encoder also keeps track of the current lines, pages, and columns and -inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and -numbers pages so that the numbering corresponding to the physical pages are -available, as compared to the "high-level" pages numbers inserted by the -editors, which start with an offset because the first, blank or almost empty -pages at the beginning of each book do not have a number and which sometimes have -gaps when a full-page geographical map is inserted since those are printed -separately on a different folio which remains outside of the textual numbering -system. The place at which these layout-related elements occur is determined by -the place where the OCR software detected them and by the reordering performed -by `soprano` when inferring the reading order before segmenting the articles. - -###### The constraints of automated processing - -Encyclopedias are particularly long books, spanning numerous tomes and -containing several tenths of thousands of articles. The *Encyclopédie* comprises -over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest -version produced by `soprano` created 160k articles, but their segmentation is -still not perfect and if some article beginning remain undetected, all the very -long and deeply-structured articles are unduly split into many parts, resulting -globally in an overestimation of the total number). - -XML-TEI is a very broad tool useful for very different applications. Some -elements like `<unclear/>` or `<factuality/>` can encode subtle semantics -information (for the second one, adjacent to a notion as elusive as truth) -which requires a very deep understanding of a text in its entirety and about -which even some human experts may disagree. - -For these reasons, a central concern in the design of our encoding scheme was to -remain within the boundaries of information that can be described objectively -and extracted automatically by an algorithm. Most of the tags presented above -contain information about the positions of the elements or their relation to one -another. Those with an additional semantics implication like `<head/>` can be -inferred simply from their position and the frequent use of a special typography -like bold or upper-case characters. - -The case of cross-references is particular and may appear as a counter-example -to the main principle on which our scheme is based. Actually, the process of -linking from an article to another one is so frequent (in dictionaries as well -as in encyclopedias) that it generally escapes the scope of regular discourse to -take a special and often fixed form, inside parenthesis and after a special -token which invites the reader to perform the redirection. In *La Grande -Encyclopédie*, virtually all the redirections (that is, to the extent of our -knowledge, absolutely all of them though of course some special case may exist, -but they are statistically rare enough that we have not found any yet) appear -within parenthesis, and start with the verb "voir" abbreviated as a single, -capital "V." as illustrated above in the article "Gelocus" (see again Figure -@fig:gelocus-photo). - -Although this has not been implemented yet either, we hope to be able to detect -and exploit those patterns to correctly encode cross-references. Getting the -`target` attributes right is certainly more difficult to achieve and may require -processing the articles in several steps, to first discover all the existing -headwords — and hence article IDs — before trying to match the words following -"V." with them. Since our automated encoder handles tomes separately and since -references may cross the boundaries of tomes, it cannot wait for the target of a -cross-reference to be discovered by keeping the articles in memory before -outputting them. - -This is in line with the last important aspect of our encoder. If many -lexicographers may deem our encoding too shallow, it has the advantage of not -requiring to keep too complex datastructures in memory for a long time. The -algorithm implementing it in `soprano` outputs elements as soon as it can, for -instance the empty elements already discussed above. For articles, it pushes -lines onto a stack and flushes it each time it encounters the beginning of the -following article. This allows the amount of memory required to remain -reasonable and even lets them be parallelised on most modern machines. Thus, -even taking over three minutes per tome, the total processing time can be -lowered to around forty minutes on a machine with 16Go of RAM for the whole of -*La Grande Encyclopédie* instead of over one hour and a half. - ## Les domaines ### Systèmes de domaines @@ -1499,19 +776,4 @@ TODO Comment être plus maligne dans l'association ? TODO Grammaire des articles -## Annotation en parties de discours et syntaxe - -### Jeu d'étiquettes - -Nous utilisons le [jeu d'étiquettes]() du projet -[PRESTO](http://presto.ens-lyon.fr/) - -Alors non en fait Stanza c'est bien aussi avec les -[UPOS](https://universaldependencies.org/docs/u/pos/) - -### Chaînes de traitement - -- PRESTO -- Stanza - diff --git "a/Corpus/Formats_et_\303\251tats.md" "b/Corpus/Formats_et_\303\251tats.md" new file mode 100644 index 0000000000000000000000000000000000000000..26db38a4fb83c2b36e3e217be08c8e0c87626efc --- /dev/null +++ "b/Corpus/Formats_et_\303\251tats.md" @@ -0,0 +1,722 @@ +## Formats et états des textes + +### L'Encyclopédie + +In common parlance, the terms "dictionaries" and "encyclopedias" are used as +near synonyms to refer to books compiling vast amounts of knowledge into lists +of definitions ordered alphabetically. Their similarity is even visible in the +way they are coordinated in the full title of the *Encyclopédie* which is +probably the most famous work of the genre and a symbol of the Age of +Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it +was much more unusual and in fact controversial when Diderot and d'Alembert +decided to use it in the title of their book. + +The definition given by Furetière in his *Dictionnaire Universel* in 1690 is +still close to its greek etymology: a "ring of all knowledges", from *κÏκλος*, +"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance +by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened +to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of +Encyclopedia"). At the time the word still mostly refers to the abstract concept +of mastering all knowledges at once. Furetière adds that it's a quality one +is unlikely to possess, and even seems to condemn its search as a form of +hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie" +("it is a recklessness for a man to want to possess Encyclopedia"). + +Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated +at the end of the 17^th^ century and attacked in the +*Dictionnaire Universel François et Latin*, commonly refered to as the +*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for +"Encyclopédie" remained unchanged in the four editons issued between 1721 and +1752, mocking the use of the word and discouraging his readers to pursue it. In +that intent, he quotes a poem from Pibrac encouraging people to specialise in +only one discipline lest they should not reach perfection, based on an +argumentation that resembles the saying "Jack of all trades, master of none". It +is all the more interesting that the definition remains unaltered until 1752, +one year after the publication of the first volume of the *Encyclopédie*. The +Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the +*Encyclopédie* which they managed to get banned the same year by the Council of +State on the charge of attempting to destroy the royal authority, inspiring +rebellion and corrupting morality in general. There is much more at stake than +words here, but the attempt to deprecate the word itself is part of their fight +against the philosophers of the Enlightenment. + +The attacks do not remain ignored by Diderot who starts the very definition of +the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He +directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as +mere self-doubt that their authors should not generalise to anyone, then leaves +the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues +that a collaborative work can achieve much more than any talented man could: +what could possibly not be within reach of a single man, within a single +lifetime may be achieved by a common effort throughout generations. + +History hints that Diderot's opponents took his defence of the feasability of +the project quite seriously, considering the fact that they got the +*Encyclopédie*'s privileges to be revoked again six years after its publication +was resumed [@moureau2001]. As a consequence, the remaining ten volumes +containing the text of the articles had to be published illegally until 1765, +thanks to the secret protection of Malesherbes who — despite being head of royal +censorship — saved the manuscripts from destruction. They were printed secretly +outside of Paris and the books were (falsely) labeled as coming from Neufchâtel. +Following the high demand from the booksellers who feared they would lose the +money they had invested in the project, a special privilege was issued for the +volumes containing the plates, which were released publicly from 1762 to 1772. + +In any case, in their last edition in 1771 the authors of the *Dictionnaire de +Trevoux* had no choice but to acknowledge the success of the encyclopedic +projects of the 18^th^ century. In this version, the definition +was entirely reworked, mildly stating that good encyclopedias are difficult to +make because of the amount of knowledge necessary and work needed to keep up +with scientific progress instead of calling the effort a parody. It credits +Chamber's *Cyclopædia* for being a decent attempt before referring anonymously +though quite explicitly to Diderot and d'Alembert's project by naming the +collective "Une Société de gens de Lettres" and writing that it started in 1751. +Even more importantly, two new entries were added after it: one for the +adjective "encyclopédique" and another one for the noun "encyclopédiste", +silently admitting how the project had changed its time and the relation to +knowledge itself. + +#### Contexte de l'Å“uvre + +#### Versions disponibles + +L'ARTFL[^ARTFL] en propose une version. + +[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/) + +#### Traitements + +### La Grande Encyclopédie + +#### Contexte de l'Å“uvre + +*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des +Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut +publiée en France entre 1885 et 1902 par une équipe de plus de deux cent +spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ +1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise +encyclopédique française majeure à marcher dans les traces de l'ancêtre +prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt. + +Le titre complet de l'Å“uvre, déjà , montre sa volonté de filiation avec l'*EDdA*, +volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022]. + +#### Versions disponibles + +Une version numérique de cette Å“uvre a été réalisée par la BnF et mise en +ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition +originale, elle comprend une image par page de l'Å“uvre, numérisée en niveau de +gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été +tirée une version partielle du texte par application d'un programme de +reconnaissance optique de caractères ([@=OCR]). Cette version présente un +certains nombre de limite qui empêchait de mener une étude intégrale du texte +par des moyens automatiques comme la textométrie. + +[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071) + +D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés, +certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun +texte n'est disponible pour ces volumes sur le site de +Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive +mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette +à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été +choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas +contigus ni au début ni à la fin de l'Å“uvre. Ensuite, cette version en «texte +brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une +annotation très superficielle et n'est en particulier par segmentée en article. +Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque +l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener +une étude contrastive en groupant les articles par domaine de connaissance ou +par auteur que d'observer la structure des domaines en comparant entre deux +encyclopédies quels articles ont été conservés ou non, et le cas échéant si le +domaine de connaissance qui leur est associé est le même. Enfin, des erreurs +dans la détection de l'organisation de la page ([@=OLR]) obscurcissent +significativement le texte en opérant des permutations locales de son contenu +qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui +complique nettement la segmentation du texte en article — et dans tous les cas +endommager la structure des phrases, ce qui est vient introduire des erreurs +dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que +nous avons besoin d'appliquer au texte pour faire de la textométrie. + +[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#) + +Dans le but de pallier à ces défauts, le projet CollEx Persée +DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en +partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette +nouvelle version a été réalisée à partir de photographies d'un exemplaire +original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal]. + +[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/) +[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t) +[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal) + +Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala], +le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num, +d'une nouvelle version de l'Å“uvre sous différents formats. + +[^nakala]: [https://nakala.fr/](https://nakala.fr/) + +#### Encodage + +##### Structure du module *dictionaries* + +**Definitions** + +By iterating several times the operation of moving on that graph along one edge, +that is, by considering the transitive closure of the relation "be connected by +an edge" we define *inclusion paths* which allow us to explore which elements +may be nested under which other. + +The nodes visited along the way represent the intermediate XML elements to +construct a valid XML tree according to the TEI schema. Given the top-down +semantics of those trees, we call the length of an inclusion path its *depth*. + +The ability for an element to contain itself corresponds directly to loops on +the graph (that is an edge from a node to itself) as can be illustrated by the +`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain +another one. + +The generalisation of this to inclusion paths of any length greater than one is +usually called a cycle and we may be tempted in our context to refine this and +name them *inclusion cycles*. The `<address/>` element provides us with an +example for this configuration: although an `<address/>` element may not +directly contain another one, it may contain a `<geogName/>` which, in turn, may +contain a new `<address/>` element. From a graph theory perspective, we can say +that it admits an inclusion cycle of length two. + +**Applications** + +Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59] +allows us to explore the shortest inclusion paths that exist between elements. +Though a particular caution should be applied because there is no guarantee that +the shortest path is meaningful in general, it at least provides us with an +efficient way to check whether a given element may or not be nested at all under +another one and gives a lower bound on the length of the path to expect. Of +course the accuracy of this heuristic decreases as the length of the elements +increases in the perfect graph representing the intended, meaningful path +between two nodes that a human specialist of the TEI framework could build. + +This is still very useful when taking into account the fact that TEI modules are +merely "bags" to group the elements and provide hints to human encoders about +the tools they might need but have no implication on the inclusion paths between +elements which cross module boundaries freely. The general graph formalism +enables us to describe complex filtering patterns and to implement queries to +look for them among the elements exhaustively by algorithmic means even when the +shortest-path approach is not enough. + +For instance, it lets one find that although `<pos/>` may not be directly +included within `<entry/>` elements to include information about the +part-of-speech of the word that an article defines, the correct way to do so is +through a `<form/>` or a `<gramGrp/>`. + +On the other hand, trying to discover the shortest inclusion path to `<pos/>` +from the `<TEI/>` root of the document yields a `<standOff/>`, an element +dedicated to store contextual data that accompanies but is not part of the text, +not unlike an annex, and widely unrelated to the context of encoding an +encyclopedia. + +A last relevant example on the use of these methods can be given by querying the +shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it +yields an inclusion directly through `<entryFree/>` (with an inclusion path of +length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly +not what we want depending on the regularity of the articles we are encoding and +the occurrence of other grammatical information such as `<case/>` or `<gen/>` to +justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to +length 3 returns as expected the path through `<entry/>`, among others. Overall, +we get a good general idea: `<pos/>` does not need to be nested very deep, it +can appear quite near the "surface" of article entries. + +##### Limites + +###### The `<entry/>` element + +The central element of the *dictionaries* module is the `<entry/>` element meant +to encode one single entry in a dictionary, that is to say a head word +associated to its definition. It is the natural way in from the `<body/>` +element to the dictionary module: indeed, although `<body/>` may also contain +`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of +`<entry/>` while the latter is a device to group several related entries +together. Both can contain an `<entry/` directly while no obvious inclusion +exists the other way around: most (> 96.2%) of the inclusion paths of +"reasonable" depth (which we define as strictly inferior to 5, that is twice the +average shortest depth between any two nodes) either include `<figure/>` or +`<castList/>`, two very specific elements which should not need to appear in an +article in general, showing that the purpose of `<entry/>` is not to contain an +`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the +documentation but also the structure of the elements graph evidence `<entry/>` +as the natural top-most element for an article. This somewhat contrived example +hopes to further demonstrate the application of a graph-centred approach to +understand the inner workings of the XML-TEI schema. + +###### Information about the headword itself + +Once a block for an article is created, it may contain elements useful to +represent various of its features. Its written and spoken forms are usually +encoded by `<form/>` elements. Grammatical information like the `<case/>`, +`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`, +along with information about the categories it belongs to like `<iType/>` for +its inflection class in languages with a declension system or `<pos/>` for its +part-of-speech. The `<etym/>` element is made to hold the etymology of an entry. +In the case when there are alternative spellings in varieties of the language or +if the spelling has changed over time, `<usg/>` can be used. + +All these examples are by no means an exhaustive list; the complete set provides +the encoder with a toolbox to describe all the information related to the form +the entry is found at and seems general enough to accomodate the structure of +any book indexing entries by words. + +###### Cross-references + +A common feature shared by dictionaries and encyclopedias is the ability to +connect entries together by using a word or short phrase as the link, referring +the reader to the related concept. This is known as cross-references and can +appear either when the definition of a term is adjacent to another one or to +catch alternative spellings where some readers might expect to find the word and +redirect them to the form chosen as the reference. In XML-TEI, this is done with +the `<xr/>` element. It usually contains the whole phrase performing the +redirection, with an imperative locution like "please see […]". + +The "active" part of the cross-reference, that is the very word within the +`<xr/>` that is considered to be the link or, to make a modern-day HTML +metaphor, the region that would be clickable, is represented by a `<ref/>` +element. Though it is not specific to the *dictionaries* module, we include it +in this description of the toolbox because it is particularly useful in the +context of dictionaries. This element may have a target attribute which points +to the other resource to be accessed by the interested reader. + +###### Definitions + +The remaining part of entries is also usually the largest and represents the +content associated to the headword by the entry. In a dictionary, that is its +meaning. + +The `<sense/>` element is a valid child for `<entry/>` and groups together a +definition of the term with `<def/>`, usage examples with `<usg/>` (another use +of this versatile element) and other high-level information such as translations +in other languages. Both `<def/>` and `<usg/>` elements may appear directly +under the `<entry/>`. + +###### Structural remarks + +Before concluding this description of the *dictionaries* module from the +perspective of someone trying to concretely encode a particular dictionary or +encyclopedia, we make use of the graph approach again to evidence some its +aspects in terms of inclusion structure. + +First, it is remarkable that all elements in the *dictionaries* module have a +cyclic inclusion path, that is to say, there is an inclusion path from each +element of this module to itself. Although having such a cycle is a widespread +property in the remainder of XML-TEI elements shared by 73.8% of them (411 out +of the 557 elements in the other modules), all 33 elements of the *dictionaries* +module having one is far above this average. In addition, the cycles appear to +be rather short, with an average length of 2.00 versus 2.50 in the rest of the +population. This observation is all the more surprising considering the fact +that the *dictionaries* module contains short "leaf" elements like `<pos/>` +which should not obviously need to admit cycles since one rather expects them to +contain only one word, like `<pos>adj</pos>` in the example given in the +official documentation. Among those (shortest) cycles, 20 include the `<cit/>` +element made to group quotations with a bibliographic reference to their source +which should clearly be unnecessary to encode an article in the general case. + +Secondly, although we have seen examples of connections from this module to the +rest of the XML-TEI, especially to the *core* module (see the case of the +`<ref/>` element above), the *dictionaries* module appears somewhat isolated +from important structural elements like `<head/>` or `<div/>`. Indeed, computing +all the paths from either `<entry/>` or `<sense/>` elements to the latter of +length shorter or equal to 5 by a systematic traversal of the graph yields +exclusively paths (respectively 9042 and 39093 of them) containing either a +`<floatingText/>` or an `<app/>` element. The first one, as its name aptly +suggests, is used to encode text that does not quite fit the regular flow of the +document, as for example in the context of an embedded narrative. Both examples +displayed in the online documentation feature a `<body/>` as direct child of +`<floatingText/>`, neatly separating its content as independent. The purpose of +the second one, although its name — short for apparatus — is less clear, is to +wrap together several versions of the same excerpts, for instance when there are +several possible readings of an unclear group of words in a manuscript, or when +the encoder is trying to compile a single version of a piece of work from +several sources which disagree over some passage. In both case, it appears +obvious that it is not something that is expected to occur naturally in the +course of an article in general. + +Thus, despite a rather dense internal connectivity, the *dictionaries* module +fails to provide encoders with a device to represent recursively nesting +structures like `<div/>`. + +The situation regarding subject indicators is hardly better outside of the +module. The `<domain/>` element despite its name belongs exclusively in the +header of a document and focuses on the social context of the text, not on the +knowledge area it covers. The `<interp/>` despite its name is not so much about +labeling something as an interpretation to give to a context (which subject +indicators could be if you consider that, placed at the beginning, they are used +to direct the mind frame of the readers towards a particular subject). However, +the documentation clearly demonstrates it as a tool for annotators of a +document, which text content is not part of the original document but some +additional result of an analysis performed in the context of the encoding, used +only throughout references in XML attributes. + +This point, although not the most concerning, still remains the hardest to +address but all things considered the `<usg/>` element stands out as the most +relevant. + +###### The notion of meaning + +Notwithstanding the correct way to represent domains of knowledge, their extent +itself raises concerns regarding the *dictionaries* module. Indeed, among the +vast collection of domains covered in encyclopedias in general and in *La Grande +Encyclopédie* in particular are historical articles and biographies. If the +notion of meaning can appear at least ill-fitting for a text describing a series +of historical events, one may still argue that it groups them into a concept and +associates it to the name of the event. But when it comes to relating the life +of a person, describing their relation to events and other persons comes out +even further from the notion of meaning. Entries such as the one about SANJO +Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*. + +{#fig:sanjo} + +Moreover, encyclopedias, because of all that they have inherited from the +philosophical Enlightenment, are not only spaces designed to assert, they also +intrinsically include an interrogative component. Some articles lay down the +basis required to understand the complexity of an issue and invite the reader to +consider it without providing a definitive answer, going as far as to explicitly +use question marks as in the article "Action" displayed in Figure @fig:action. + +{#fig:action} + +In this extract, the author devises a hypothetical situation to illustrate how +difficult it is to draw the line between two supposedly mutually exclusive +subcategories of legal actions. The whole point of the passage is to convey the +idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a +`<def/>` element would be an utter misnomer. + +As a result, the use of `<sense/>` and `<def/>` is not appropriate for +encyclopedic content in general. + +###### Nested structures + +The final difficulty can be considered as a partial consequence of the previous +one on the structure of articles. The difficulty to define complex concepts is +the very reason why authors approach their subjects from various angles, +circumnavigating it as a best approximation. This strategy favours long, +structured developments with sections and subsections covering the multiple +aspects of the topic: from a historical, political, scientific point of view… +The longest articles, such as article "Europe" shown in Figure @fig:europe, can +thus span several dozens of pages. They can contain substructures with titles on +at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of +which are in turn generally developed over several paragraphs. + +{#fig:europe} + +The nested structure that we have just evidenced demands of course a nesting +structure to accomodate it. More precisely it guides our search of XML elements +by giving us several constraints: we are looking for a pair of elements, the +first representing a (sub)section must be able to include both itself and the +second element, which does not have any special constraint except the one to +have a semantics compatible with our purpose of using it to represent section +titles. In addition, the first element must be able to contain several `<p/>` +elements, `<p/>` being the reference element to encode paragraphs according to +the XML-TEI documentation. + +We have seen that the *dictionaries* module was equiped with a questionable but +possible element for subject domains. However, it does not include any element +for section titles. In the rest of the TEI specification, the elements `<head/>` +and `<title/>` — the latter with the possibility to set its `type` attribute to +`sub` — stand out as the best candidates for the semantics condition on the +second element. + +##### Choix + +###### Candidates in the *dictionaries* module + +Filtering the content of the module to keep only the elements which can at the +same time contain themselves, be included under `<entry/>` and include a `<p/>` +and either the `<head/>` or `<title/>` elements yields absolutely no candidates. +It is remarkable that even replacing the `<entry/>` element for the root of each +article with an `<entryFree/>`, an element supposed to relax some constraint to +accomodate more unusual structure in dictionaries does not bring any +improvement. + +The lack of results from these simple queries forces us to somewhat release the +constraints on the encoding we are willing to use. We can for instance make the +asumption that the occurrence of an intermediate element could be needed between +the element wrapping the whole article and the recursing one used to encode each +section. This "section" element could also need a companion element to be able +to include itself, or, to formalise it in terms of graph theory, we could relax +the condition that this element admits a loop to consider instead cycles of a +given (small, this still needs to represent a fairly direct inclusion) length to +be enough. We simultaneously extend the maximum depth of the inclusion paths we +are looking for between `<entry/>`, the pair of elements and the `<p/>` element. + +By setting this depth to 3, that is, by accepting one intermediate element to +occur in the middle of each one of the inclusion paths that define the structure +required to encode encyclopedic discourse, we find 21 elements but none of them +stand out as an obvious good solution: all paths to include the `<p/>` element +from any *dictionaries* element either contains a `<figure/>` (which we have +encountered earlier when we were practising our graph approach to search for +inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in +general), a `<stage/>` (reserved to stage direction in dramatic works) or a +`<state/>` (used to describe a temporary quality in a person or place), again +not even close to what we want. The paths to either `<head/>` or `<title/>` are +similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns +the exact same candidates. If that is not a thorough proof that none of these +elements could fulfill our purpose, it is a fact than no element in this module +appears as an obvious good solution and a serious hint to keep looking somewhere +else. + +###### Widening the search + +We hence widen our search to include elements outside the *dictionaries* module +which could be used to encode our sections and subsections, under the same +constraint as before to try and find a composite solution that would remain +under the `<entry/>` element even if resorting to subcomponents outside of the +dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>` +and `<note/>`. + +The first one as we have repeatedly underlined is meant for graphic information +and is not suitable for text content in general. + +The purpose of `<metamark/>` is to transcribe the edition marks than may appear +on a particular primary source in order to alter the normal flow of the text and +suggest an alternative reading (deletion, insertion, reordering, this is about a +human editing the text from a given physical copy of it), but it is +unfortunately of no use to encode a section of an article. + +The first element that might at least resemble what we are looking for is the +last one, `<note/>`. It is meant to contain text, is about explaning something +and seems general enough (not specific to a given genre, or to the occurrence of +a particular object on the page). Unfortunately, its semantics still seems a bit +off compared to our need. The documentation describes it as an "additional +comment" which appears "out of the main textual stream" whereas the long +developments in articles are the very matter of the text of encyclopedias, not +mere remarks in the margins or at the foot of pages. + +##### Implémentation + +The above remarks explain why the *dictionary* module is unable to represent +encyclopedias, where the notion of "meaning" is less central that in +dictionaries and where discourse with nested structures of arbitrary depth can +occur. Even composite encodings using elements outside of the *dictionaries* +module under an `<entry/>` element do not meet our requirements. Since the +*core* module of course accomodates these structures by means of the `<div/>`, +`<head/>` and `<p/>` elements which have the additional advantage of carrying +less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme +using them which we recommend using for other projects aiming at representing +encyclopedias. + +To remain consistent with the above remarks we will only concern ourselves with +what happens at the level of each article, right under the `<body/>` element. +Everything related to metadata happens as expected in the file's `<teiHeader/>` +which is well-enough equiped to handle them. In order to present our scheme +throughout the following section we will be progressively encoding a reference +article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo. + +{#fig:cathete-photo} + +###### The scheme + +Remaining within the *core* module for the structure, almost all useful elements +are available and our encoding scheme merely quotes the official documentation. +Each article is represented by a `<div/>`. We suggest setting an `xml:id` +attribute on it with the head word of the entry — unique in the whole corpus, or +made so by suffixing a number representing its rank among the various +occurrences, even when there's only one for the sake of regularity — as its +value, normalised to lowercase, stripping spaces and replacing all +non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML +encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container +element on the article "Cathète" previously displayed. + +{#fig:cathete-xml-0} + +Inside this element should be a `<head/>` enclosing the headword of the article. +The usual sub-`<hi/>` elements are available within `<head/>` if the headword is +highlighted by any special typographic means such as bold, small capitals, etc. +The one disappointment of the encoding scheme we are defining in this chapter is +the lack of support for a proper way to encode subject indicators. + +The best candidate we have found so far was `<usg/>` from the *dictionaries* +module but it is not available directly under a `<head/>` element. All inclusion +paths from the latter to the former of length less than or equal to 3 contain +irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it +must be discarded. The next best elements appear to be `<term/>` (not very +accurate) and `<rs/>` ("referring string", quite a general semantics but a +possible match — subject indicators refer to a given domain of knowledge — +although all the examples in the documentation refer to concrete persons, +places or object, not to the abstract objects that mathematics or poetry are). + +For this reason, we do not recommend any special encoding of the subject +indicator but leave it open to each particular context: they are often +abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies +are not labeled by a knowledge domain but usually include the first name of the +person when it is known so in that case an element like `<persName/>` is still +appropriate. This choice applied to the same article "Cathète" produces Figure +@fig:cathete-xml-1. + +{#fig:cathete-xml-1} + +We then propose to wrap each different meaning in a separate `<div/>` with the +`type` attribute set to `sense` to refer to the `<sense/>` element that would +have been used within the *core* module. The `<div/>`s should be numbered +according to the order they appear in with the `n` attribute starting from `0` +as shown in Figure @fig:cathete-xml-2. + +{#fig:cathete-xml-2} + +In addition, each line within the article must start with a `<lb/>` to mark its +beginning including before the `<head/>` element as demonstrated by Figure +@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that +in the dense layout of encyclopedias, the carriage return separating two +articles is meaningful. Stating each new line explicitly keeps enough +information to reconstruct a faithful facsimile but it also has the advantage of +highlighting the fact than even though the definition is cut from the headword +by being in a separate XML element, they still occur on the same line, which is +a typographic choice usually made both in encyclopedias and dictionaries where +space is at a premium. . + +To complete the structure, the various sections and subsections occurring +within the article body may be nested as usual with `<div/>` and sub-`<div/>`s, +filled with `<p/>` for paragraphs which can each be titled with `<head/>` +elements local to each `<div/>`. + +{#fig:cathete-xml-3} + +Some articles such as "Boumerang" have figures with captions, as illustrated by +Figure @fig:boumerang-photo, which should be encoded the standard way by +`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml. + +{height=300px #fig:boumerang-photo} + +{#fig:boumerang-xml} + +Another issue arising from giving up on `<entry/>` is the unavailability of the +`<xr/>` element, not allowed under any of the *core* elements we use but which +is useful to represent cross-references occurring in encyclopedias as well as in +dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo). +We prefer to use the `<ref/>` element instead which is available in the context +of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the +article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml. +Another solution would have been to introduce a `<dictScrap/>` element for the +sole purpose of placing an `<xr/>` but we advocate against it on account of the +verbosity it would add to the encoding and the fact that it implicitly suggests +that the previous context was not the one of a dictionary. + +{#fig:gelocus-photo} + +{#fig:gelocus-xml} + +A typical page of an encyclopedia also features peritext elements, giving +information to the reader about the current page number along with the headwords +of the first and last articles appearing on the page. Those can be encoded by +`<fw/>` elements ("forme work") which `place` and `type` attributes should be +set to position them on the page and identify their function if it has been +recognised (those short elements on the border of pages are the ones typically +prone to suffer damages or be misread by the OCR). + +Finally there are other TEI elements useful to represent "events" in the flow of +the text, like the beginning of a new column of text or of a new page. Figure +@fig:alcala-photo shows the top left of the last page of the first tome of *La +Grande Encyclopédie* which features peritext elements while marking the +beginning of a new page. The usual appropriate elements (`<pb/>` for page +beginning, `<cb/>` for column beginning) may and should be used with our +encoding scheme as demonstrated by Figure @fig:alcala-xml. + +{width=350px #fig:alcala-photo} + +{#fig:alcala-xml} + +###### Currently implemented + +The reference implementation for this encoding scheme is the program +soprano[^soprano] developed within the scope of project DISCO-LGE to +automatically identify individual articles in the flow of raw text from the +columns and to encode them into XML-TEI files. Though this software has already +been used to produce the first TEI version of *La Grande Encyclopédie*, it does +not yet follow the above specification perfectly. Figure +@fig:cathete-xml-current shows the encoded version of article "Cathète" it +currently produces: + +[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano) + +{#fig:cathete-xml-current} + +The headword detection system is not able to capture the subject indicators yet +so it appears outside of the `<head/>` element. No work is performed either to +expand abbreviations and encode them as such, or to distinguish between domain +and people names. + +Likewise, since the detection of titles at the beginning of each section is not +complete, no structure analysis can be performed at the moment on the textual +development inside the article and it is left unstructured, directly under the +entry's `<div/>` element instead of under a set of nested `<div/>` elements. The +paragraphs are not yet identified and for this reason not encoded. + +However, the figures and their captions are already handled correctly when they +occur. The encoder also keeps track of the current lines, pages, and columns and +inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and +numbers pages so that the numbering corresponding to the physical pages are +available, as compared to the "high-level" pages numbers inserted by the +editors, which start with an offset because the first, blank or almost empty +pages at the beginning of each book do not have a number and which sometimes have +gaps when a full-page geographical map is inserted since those are printed +separately on a different folio which remains outside of the textual numbering +system. The place at which these layout-related elements occur is determined by +the place where the OCR software detected them and by the reordering performed +by `soprano` when inferring the reading order before segmenting the articles. + +###### The constraints of automated processing + +Encyclopedias are particularly long books, spanning numerous tomes and +containing several tenths of thousands of articles. The *Encyclopédie* comprises +over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest +version produced by `soprano` created 160k articles, but their segmentation is +still not perfect and if some article beginning remain undetected, all the very +long and deeply-structured articles are unduly split into many parts, resulting +globally in an overestimation of the total number). + +XML-TEI is a very broad tool useful for very different applications. Some +elements like `<unclear/>` or `<factuality/>` can encode subtle semantics +information (for the second one, adjacent to a notion as elusive as truth) +which requires a very deep understanding of a text in its entirety and about +which even some human experts may disagree. + +For these reasons, a central concern in the design of our encoding scheme was to +remain within the boundaries of information that can be described objectively +and extracted automatically by an algorithm. Most of the tags presented above +contain information about the positions of the elements or their relation to one +another. Those with an additional semantics implication like `<head/>` can be +inferred simply from their position and the frequent use of a special typography +like bold or upper-case characters. + +The case of cross-references is particular and may appear as a counter-example +to the main principle on which our scheme is based. Actually, the process of +linking from an article to another one is so frequent (in dictionaries as well +as in encyclopedias) that it generally escapes the scope of regular discourse to +take a special and often fixed form, inside parenthesis and after a special +token which invites the reader to perform the redirection. In *La Grande +Encyclopédie*, virtually all the redirections (that is, to the extent of our +knowledge, absolutely all of them though of course some special case may exist, +but they are statistically rare enough that we have not found any yet) appear +within parenthesis, and start with the verb "voir" abbreviated as a single, +capital "V." as illustrated above in the article "Gelocus" (see again Figure +@fig:gelocus-photo). + +Although this has not been implemented yet either, we hope to be able to detect +and exploit those patterns to correctly encode cross-references. Getting the +`target` attributes right is certainly more difficult to achieve and may require +processing the articles in several steps, to first discover all the existing +headwords — and hence article IDs — before trying to match the words following +"V." with them. Since our automated encoder handles tomes separately and since +references may cross the boundaries of tomes, it cannot wait for the target of a +cross-reference to be discovered by keeping the articles in memory before +outputting them. + +This is in line with the last important aspect of our encoder. If many +lexicographers may deem our encoding too shallow, it has the advantage of not +requiring to keep too complex datastructures in memory for a long time. The +algorithm implementing it in `soprano` outputs elements as soon as it can, for +instance the empty elements already discussed above. For articles, it pushes +lines onto a stack and flushes it each time it encounters the beginning of the +following article. This allows the amount of memory required to remain +reasonable and even lets them be parallelised on most modern machines. Thus, +even taking over three minutes per tome, the total processing time can be +lowered to around forty minutes on a machine with 16Go of RAM for the whole of +*La Grande Encyclopédie* instead of over one hour and a half. + + diff --git a/Corpus/text.sh b/Corpus/text.sh new file mode 100755 index 0000000000000000000000000000000000000000..1206b755369d803cfa908d868493c6f1eb8c883b --- /dev/null +++ b/Corpus/text.sh @@ -0,0 +1,7 @@ +#!/bin/sh + +source ./chapter.sh 'Préparation et enrichissement du corpus' + +cat Corpus/Formats_et_états.md +cat Corpus/Domaines.md +cat Corpus/Annotation.md diff --git a/Glossaire/OCR.md b/Glossaire/OCR.md new file mode 100644 index 0000000000000000000000000000000000000000..9c0ac9758384ff41716ef93d5587f47d091d0092 --- /dev/null +++ b/Glossaire/OCR.md @@ -0,0 +1,7 @@ +OCR + +: *Optical Character Recognition*, reconnaissance optique de caractères, est +le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de +caractères compréhensibles par la machine et traitables ensuite par des moyens +automatiques, à partir d'une image. + diff --git a/Glossaire.md b/Glossaire/OLR.md similarity index 73% rename from Glossaire.md rename to Glossaire/OLR.md index 2798cb54bba2e3bd5cd32e139099006456be56e6..9e4665549db43d1f312186708a0c2822a4cfa5e3 100644 --- a/Glossaire.md +++ b/Glossaire/OLR.md @@ -1,12 +1,3 @@ -# Glossaire {-} - -OCR - -: *Optical Character Recognition*, reconnaissance optique de caractères, est -le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de -caractères compréhensibles par la machine et traitables ensuite par des moyens -automatiques, à partir d'une image. - OLR : *Optical Layout Recognition*, reconnaissance optique de la disposition de la diff --git a/Glossaire/text.sh b/Glossaire/text.sh new file mode 100755 index 0000000000000000000000000000000000000000..fd4527981dcd1dfa0268ad07b5e13f66e1d25db4 --- /dev/null +++ b/Glossaire/text.sh @@ -0,0 +1,8 @@ +#!/bin/sh + +[ -n "${HEADER_INCLUDED}" ] || source ./header.sh 2 + +echo '# Glossaire {-}' + +cat Glossaire/OCR.md +cat Glossaire/OLR.md diff --git "a/G\303\251ographie/Contours.md" "b/G\303\251ographie/Contours.md" new file mode 100644 index 0000000000000000000000000000000000000000..878c6f5c86648d16ca26b0404042485803cbe19b --- /dev/null +++ "b/G\303\251ographie/Contours.md" @@ -0,0 +1,11 @@ +## Tracer le contours de la géographie + +### Établir une correspondance + +Empiriquement: + + avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés + + machine learning + +### La biographie cachée + + diff --git "a/G\303\251ographie/ENE.md" "b/G\303\251ographie/ENE.md" new file mode 100644 index 0000000000000000000000000000000000000000..2a76f9049f0fa35d395ab11e3ece2b93f4f7fa4a --- /dev/null +++ "b/G\303\251ographie/ENE.md" @@ -0,0 +1,9 @@ +## Entités Nommées Étendues + +Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022 + +### Travaux sur les GNNs + +Qu'est-ce qu'on en a retiré ? + + diff --git "a/G\303\251ographie.md" "b/G\303\251ographie/Relations_entre_domaines.md" similarity index 98% rename from "G\303\251ographie.md" rename to "G\303\251ographie/Relations_entre_domaines.md" index c35139f1185a6d61a2b9af53663bf6b5f1fe7516..992c20b94a278ae6f650578767b4532af6b361ae 100644 --- "a/G\303\251ographie.md" +++ "b/G\303\251ographie/Relations_entre_domaines.md" @@ -1,24 +1,3 @@ -# Identifier et problématiser la géographie - -## Relation entre spatial et géographique - --> questionnement d'une frontière même - -(structuration de la géographie) - -## Tracer le contours de la géographie - -### Établir une correspondance - -Empiriquement: - + avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés - + machine learning - -### La biographie cachée - -## Variété des genres discursifs au sein des articles - - ## Relations entre les domaines de connaissances ### Erreurs de classification @@ -735,11 +714,4 @@ differences we have underlined show that size alone cannot explain their distribution in detail. The model does seem to identify some classes more easily because of distinctive lexical patterns. -## Entités Nommées Étendues - -Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022 - -### Travaux sur les GNNs - -Qu'est-ce qu'on en a retiré ? diff --git "a/G\303\251ographie/Spatial_et_g\303\251ographie.md" "b/G\303\251ographie/Spatial_et_g\303\251ographie.md" new file mode 100644 index 0000000000000000000000000000000000000000..c8b326668fd2c86828d5818d29467bf609012aae --- /dev/null +++ "b/G\303\251ographie/Spatial_et_g\303\251ographie.md" @@ -0,0 +1,7 @@ +## Relation entre spatial et géographique + +-> questionnement d'une frontière même + +(structuration de la géographie) + + diff --git "a/G\303\251ographie/Vari\303\251t\303\251_des_genres_discursifs.md" "b/G\303\251ographie/Vari\303\251t\303\251_des_genres_discursifs.md" new file mode 100644 index 0000000000000000000000000000000000000000..2fc3e16ad990651255309895ba0c7641ac3db91a --- /dev/null +++ "b/G\303\251ographie/Vari\303\251t\303\251_des_genres_discursifs.md" @@ -0,0 +1,4 @@ +## Variété des genres discursifs au sein des articles + + + diff --git "a/G\303\251ographie/text.sh" "b/G\303\251ographie/text.sh" new file mode 100755 index 0000000000000000000000000000000000000000..f6d893f45ccc8699725108482822a2940fbfe627 --- /dev/null +++ "b/G\303\251ographie/text.sh" @@ -0,0 +1,9 @@ +#!/bin/sh + +source ./chapter.sh 'Identifier et problématiser la géographie' + +cat Géographie/Spatial_et_géographie.md +cat Géographie/Contours.md +cat Géographie/Variété_des_genres_discursifs.md +cat Géographie/Relations_entre_domaines.md +cat Géographie/ENE.md diff --git a/Introduction.md b/Introduction.md deleted file mode 100644 index 136c48f8eb3dafedbd610b16dae7979fe465fb58..0000000000000000000000000000000000000000 --- a/Introduction.md +++ /dev/null @@ -1,57 +0,0 @@ ---- -title: Méthodes et outils pour l'étude diachronique des discours géographiques dans les encyclopédies françaises -author: Alice \textsc{Brenon} -documentclass: report -classoptions: - - french - - a4paper - - 11pt -numbersections: true -header-includes: - - \setcounter{tocdepth}{2} - - \setcounter{secnumdepth}{2} - - \usepackage{textalpha} - - \usepackage{geometry} - - \usepackage{caption} - - \usepackage{subcaption} ---- - -\tableofcontents - -\newpage - -# Introduction {-} - -## Cadre de cette thèse - -### Le genre encyclopédique - -L'«esprit encyclopédique» [@Macary1973_MACLDU] - -Les précurseurs de l'EDdA : Basnage [@galleron_tenir_2022] dont est issu le -Trevoux [@le_guern_caief_0571_5865_1983_num_35_1_2402], qui se posera comme un -farouche opposant de l'EDdA [@morin_rde_0769_0886_1989_num_7_1_1034]. L'EDdA ne -devait initialement être qu'une traduction de Chambers -[@kafker_andre_francois_2016]. - -### La géographie, une science en recomposition - -Période intermédiaire marquée par une professionnalisation -[@rey_professionnalisation_2022] de l'encyclopédisme - -### Le projet GÉODE - -Notre corpus de 4 encyclopédies que nous avons choisies -> celles que j'ai pu -regarder et pourquoi - -## Contributions - -### Version numérique structurée de LGE - -Segmentation (premier résultat par rapport à la version de base — "baseline" — -de fin de Collex-Persée, pour tâche de segmentation). Visée patrimoniale, outil -pour les chercheur·ses en SHS (recherche par vedette). - -### La biographie dans l'EDdA - -### Motifs discursifs géographiques diff --git a/Introduction/Cadre.md b/Introduction/Cadre.md new file mode 100644 index 0000000000000000000000000000000000000000..8aa8d47d80a4c4e52d1bfbabed9849a3663e011c --- /dev/null +++ b/Introduction/Cadre.md @@ -0,0 +1,22 @@ +## Cadre de cette thèse + +### Le genre encyclopédique + +L'«esprit encyclopédique» [@Macary1973_MACLDU] + +Les précurseurs de l'EDdA : Basnage [@galleron_tenir_2022] dont est issu le +Trevoux [@le_guern_caief_0571_5865_1983_num_35_1_2402], qui se posera comme un +farouche opposant de l'EDdA [@morin_rde_0769_0886_1989_num_7_1_1034]. L'EDdA ne +devait initialement être qu'une traduction de Chambers +[@kafker_andre_francois_2016]. + +### La géographie, une science en recomposition + +Période intermédiaire marquée par une professionnalisation +[@rey_professionnalisation_2022] de l'encyclopédisme + +### Le projet GÉODE + +Notre corpus de 4 encyclopédies que nous avons choisies -> celles que j'ai pu +regarder et pourquoi + diff --git a/Introduction/Contributions.md b/Introduction/Contributions.md new file mode 100644 index 0000000000000000000000000000000000000000..5e3338a04ef595b987f2705995613769014bcaf9 --- /dev/null +++ b/Introduction/Contributions.md @@ -0,0 +1,11 @@ +## Contributions + +### Version numérique structurée de LGE + +Segmentation (premier résultat par rapport à la version de base — "baseline" — +de fin de Collex-Persée, pour tâche de segmentation). Visée patrimoniale, outil +pour les chercheur·ses en SHS (recherche par vedette). + +### La biographie dans l'EDdA + +### Motifs discursifs géographiques diff --git a/Introduction/text.sh b/Introduction/text.sh new file mode 100755 index 0000000000000000000000000000000000000000..dac89bb3c2a8712ff2156c07b521fd8c4633abab --- /dev/null +++ b/Introduction/text.sh @@ -0,0 +1,7 @@ +#!/bin/sh + +source ./chapter.sh "Introduction {-}" + +cat Introduction/Cadre.md +cat Introduction/Contributions.md + diff --git a/Makefile b/Makefile index f9d650d9eb386bc6ca09dd87dae25c1104dd46bf..7044dbcabf2340d6ad620d9c749fca7f52d9243c 100644 --- a/Makefile +++ b/Makefile @@ -1,21 +1,33 @@ DOCUMENT = Manuscrit -CHAPTERS = Introduction ÉdlA Corpus Géographie Contrastes Conclusion Glossaire Bibliographie -SOURCES = $(CHAPTERS:%=%.md) +CHAPTERS = Introduction ÉdlA Corpus Géographie Contrastes Conclusion Glossaire +SOURCES = $(CHAPTERS:%=%/text.sh) BIBLIOGRAPHY = biblio.bib SNIPPETS = $(wildcard src/*.md) GRAPHS = $(wildcard src/*.gv) -#PICTURES = $(action_t1 arbre boumerang_t7 cathète_t9 europe_t16 gelocus_t18 last_page_top_left_t1 sanjo_t29:%=article/%) -#FIGURES = $(PICTURES:%=figure/%.png) $(GRAPHS:src/%.gv=figure/%.png) $(SNIPPETS:src/%.md=figure/%.png) -FIGURES = $(shell sed -n 's@.*(\(figure/.*.\(png\|jpe?g\)\)).*@\1@p' $(SOURCES)) +FIGURES = $(shell find $(CHAPTERS) -type f -name '*.md' -exec cat '{}' \; | sed -n 's@.*(\(figure/.*.\(png\|jpe?g\)\)).*@\1@p') CSL = apa.csl FILTERS = pandoc-fignos +LUA_FILTERS = ./filters/with-bibliography.lua +WITH_FILTERS = $(FILTERS:%=--filter %) $(LUA_FILTERS:%=--lua-filter %) FILTER_SCRIPTS = glossary SCRIPTS = $(FILTER_SCRIPTS:%=scripts/%) +DEPENDENCIES=$(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY) + +.SECONDEXPANSION: + +sources = $(shell find $(1) -type f -name '*.md') +chapter-sources = $(call sources,$*) + all: $(DOCUMENT).pdf -$(DOCUMENT).pdf: $(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY) $(SOURCES) - cat $(SOURCES) $(SCRIPTS:%=| %) | pandoc $(FILTERS:%=--filter %) --citeproc --bibliography=$(BIBLIOGRAPHY) --csl=$(CSL) -o $@ +$(CHAPTERS:%=%.pdf): + +$(DOCUMENT).pdf: $(DOCUMENT).sh $(foreach chapter,$(CHAPTERS),$(call sources,$(chapter))) $(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY) + ./$(DOCUMENT).sh $(SCRIPTS:%=| %) | pandoc $(WITH_FILTERS) -o $@ + +%.pdf: %/text.sh $${chapter-sources} $(SCRIPTS) $(FIGURES) $(BIBLIOGRAPHY) + $< $(SCRIPTS:%=| %) | pandoc $(WITH_FILTERS) -o $@ figure/%.png: src/%.gv dot -Tpng $< -o $@ diff --git a/Manuscrit.sh b/Manuscrit.sh new file mode 100755 index 0000000000000000000000000000000000000000..01c7075fc61ceed42ce33e777fbd1a329667a5ce --- /dev/null +++ b/Manuscrit.sh @@ -0,0 +1,18 @@ +#!/bin/sh + +. ./header.sh + +cat <<EOF +\\tableofcontents +\\newpage +EOF + + +Introduction/text.sh +ÉdlA/text.sh +Corpus/text.sh +Géographie/text.sh +Contrastes/text.sh +Conclusion/text.sh +Glossaire/text.sh + diff --git a/chapter.sh b/chapter.sh new file mode 100644 index 0000000000000000000000000000000000000000..b5159b91c78d20290be9fd3d9fc50203a4650b68 --- /dev/null +++ b/chapter.sh @@ -0,0 +1,5 @@ +[ -n "${HEADER_INCLUDED}" ] || source ./header.sh 2 + +echo "# ${1}" +echo '\etocsettocstyle{\rule{\linewidth}{\tocrulewidth}\vskip0.5\baselineskip}{\rule{\linewidth}{\tocrulewidth}}' +echo '\localtableofcontents' diff --git a/filters/with-bibliography.lua b/filters/with-bibliography.lua new file mode 100644 index 0000000000000000000000000000000000000000..e242a9d7527cdfea4fe4e2a5d3911fc5196915a8 --- /dev/null +++ b/filters/with-bibliography.lua @@ -0,0 +1,20 @@ +function Pandoc(doc) + level = doc.meta['bibliography-level'] + if level == nil then + level = 1 + else + level = level[1].text + end + + title = doc.meta['bibliography-title'] + if title == nil or title == '' then + error("The bibliography-title metadata parameter hasn't been defined") + end + + doc.blocks:extend({pandoc.Header( + level, + title, + {id = 'bibliography', class = 'unnumbered'} + )}) + return pandoc.utils.citeproc(doc) +end diff --git a/header.sh b/header.sh new file mode 100644 index 0000000000000000000000000000000000000000..ea9e184fb69a993f75fc310cef7b5b2db485155e --- /dev/null +++ b/header.sh @@ -0,0 +1,9 @@ +echo '---' +if [ -z "${1}" ] +then + cat manuscrit.yml +else + sed 1,${1}d manuscrit.yml +fi +echo '---' +export HEADER_INCLUDED=Y diff --git a/manuscrit.yml b/manuscrit.yml new file mode 100644 index 0000000000000000000000000000000000000000..8a4864de98061ac06265a5edc3a52817b343bcb8 --- /dev/null +++ b/manuscrit.yml @@ -0,0 +1,24 @@ +title: Méthodes et outils pour l'étude diachronique des discours géographiques dans les encyclopédies françaises +author: Alice \textsc{Brenon} +documentclass: report +classoptions: + - a4paper + - 11pt +numbersections: true +bibliography: biblio.bib +csl: apa.csl +link-citations: true +bibliography-level: 1 +bibliography-title: Bibliographie +header-includes: + - \usepackage[french]{babel} + - \setcounter{tocdepth}{2} + - \setcounter{secnumdepth}{2} + - \usepackage{textalpha} + - \usepackage{geometry} + - \usepackage{caption} + - \usepackage{subcaption} + - \usepackage{etoc} + - \newlength\tocrulewidth + - \setlength{\tocrulewidth}{1.5pt} + diff --git "a/\303\211dlA.md" "b/\303\211dlA.md" deleted file mode 100644 index 9948b96da2ed2f4f51574b30e38427a342a9d5a2..0000000000000000000000000000000000000000 --- "a/\303\211dlA.md" +++ /dev/null @@ -1,392 +0,0 @@ -# État de l'art - -## Textométrie - -### Cadre - -Origine via l'«École Française» de Benzécri [@benzecri__analyse_1973] tout à -fait du côté mathématique / statistiques. Initialement, ça ne concerne que les -mots bruts (les formes), puis la technologie permet de traiter du texte annoté -(morpho-syntaxe puis syntaxe), faisant émerger la linguistique de corpus -[@nazarenko_hal_00619268]. - -Différentes modèles de distribution statistique des mots sont employées: khi2, -loi de Poisson. @lafon_sur_1980 propose l'emploi d'une loi hypergéométrique -(choix qui restera dans la conception de TXM [@heiden2010]). - -L'ouvrage fondateur traite de l'utilisation des corpus annotés en commentant une -étude de discours de Mitterrand [@Labb1983FranoisM] \(un précurseur du corpus -des VÅ“ux de TXM [@heiden2010] ?), puis des dimensions transversales et de l'usage -contrastif dans le cadre d'études diachroniques et enfin traite de la -constitution des corpus eux-même. L'horizon est à l'époque le million de mots -(notre corpus parallèle, 8 millions de tokens). - -### Contrastes - -Sur la constitution des corpus @pincemin_heterogeneite_2012 avertit qu'il est -plus qu'un agglomérat de textes, tout en mentionnant une approche *WAC* -privilégiant les volumes sur une construction délibérée. Notre étude se situe un -peu entre les deux j'imagine ? Pas de place pour des textes non-encyclopédiques -pour contraster, et un peu les articles qu'on peut récupérer dans l'état dans -lequel on peut les récupérer. - -@laramee_production_2017 emploie une démarche contrastive pour faire opposer les -tomes de l'EDdA et mettre en évidence le rôle des différents auteurs. - -### Arbre lexico-syntaxiques récurrents - -On commence à mentionner dans @nazarenko_hal_00619268 des «stéréotypes» - -Ils sont basés sur les notions de collocations @fellbaum_idioms_2007 puis de motif -@longree_les_2008 - -sont un processus récursif et permettent de s'abstraire des réalisation de -surface contigentes à une langue @tutin_routines_2016 - -### Possibilités - -Des tournures de phrases peuvent être liées à des genres, ce qui peut être -révélé par une étude contrastive @kraif_constructions_2016, -@gonon_phraseologismes_2020 similaire à notre objectif. - -## La place de la géographie - - -## Genre textuel - -### Saisir la notion de genre - -@beauvisage_2001 explore le genre policier ($\rightarrow$ à lire pour voir s'il -y a une caractérisation intéressante de la notion de «genre») - -### Le cas de la lexicographie - -Les dictionnaires entretiennent une relation étroite avec la notion de -collocations et de phraséologismes: les entrées sont d'autant plus utiles qu'elles tiennent compte des -phraséologismes existant dans la langue, des modèles de langue - -@zhu_discours_2022 s'intéresse à la structure propre aux dictionnaires qui met -en relation un terme et une définition. @loiseau_dictionnaires_2011 - -If encyclopedias are thus historically more recent than dictionaries they also -depart from the latter on their approach. The purpose of dictionaries from their -origin is to collect words, to make an exhaustive inventory of the terms used in -a domain or in a language in order to associate a *definition* to them, be it a -translation in another language for a foreign language dictionary or a phrase -explaining it for other dictionaries. As such, they are collections of *signs* -and remain within the linguistic level of things. Entries in a dictionary often -feature information such as the part of speech, the pronunciation or the -etymology of the word they define. - -The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three -types of dictionaries: one to define *words*, the second to define *facts* and -the last one to define *things*, corresponding to the distinction between -language, history, and science and arts dictionaries although according to its -author, d'Alembert, each has to be of more than just one kind to be really good. -In the full title of the *Encyclopédie*, the concept is more or less equated by -means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*, -"reasoned dictionary", introducing the idea of encyclopedias as dictionaries -with additional structure and a philosophical dimension. - -Back to the "Encyclopédie" article we read that a dictionary remaining strictly -at the language level, a vocabulary, can be seen as the empty frame required for -an encyclopedic dictionary that will fill it with additional depth. Given how -d'Alembert insists on the importance of brevity for a clear definition in the -"Dictionnaire de Langues" entry, it is clear that the *encyclopédistes* did not -consider encyclopedias superior to dictionaries but really as a new subgenre -departing from them in terms of purpose. - -The first immediately visible feature that sets encyclopedias apart from -dictionaries and can be found in the *Encyclopédie* as well as in *La Grande -Encyclopédie* is the presence of subject indicators at the beginning of articles -right after the headword which organise them into a domain classification -system. Those generally cover a broad range of subjects from scientific -disciplines to litterature, and extending to political subjects and law. - -No element in the *dictionaries* module is explicitely designed for the purpose -of encoding these indicators. As we have seen above, the elements set is geared -towards the words themselves instead of the concept they represent. The closest -tool for what we need is found in the `<usg/>` element used with a specific -`type` attribute set to `dom` for "domain". Indeed several examples from the -documentation encode subject indicators very similar to the ones found in -encyclopedias within this element, but the match is not perfect either: all -appear within one of multiple senses, as if to clarify each context in which the -word can be used, as expected from the element's name, "usage". In -encyclopedias, if the domain indicator does in certain cases help to distinguish -between several entries sharing the same headword, the concept itself has -evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the -adjective *raisonné* in the rest of the title directly introduces a notion of -structure that links back to the "Systême figuré des connoissances humaines" -[@blanchard2002] which schematic structure is shown in Figure -@fig:systeme_figure. The authors have devised a branching system to classify all -knowledge, and the occurrence at the beginning of articles, more than a tool to -clear up possible ambiguities also points the reader to the correct place in -this mind map. - -{#fig:systeme_figure} - -The situation regarding subject indicators is hardly better outside of the -module. The `<domain/>` element despite its name belongs exclusively in the -header of a document and focuses on the social context of the text, not on the -knowledge area it covers. The `<interp/>` despite its name is not so much about -labeling something as an interpretation to give to a context (which subject -indicators could be if you consider that, placed at the beginning, they are used -to direct the mind frame of the readers towards a particular subject). However, -the documentation clearly demonstrates it as a tool for annotators of a -document, which text content is not part of the original document but some -additional result of an analysis performed in the context of the encoding, used -only throughout references in XML attributes. - -This point, although not the most concerning, still remains the hardest to -address but all things considered the `<usg/>` element stands out as the most -relevant. - -### Discours scientifique - -Étudié sous l'angle des ALR par @ji_hal_01956323 - -## Diachronie - -### Diachronie - -@diwersy_ressources_2017 s'attache à montrer les difficultés rencontrées en -français sur la période XVIème -> XVIIème (graphie, ordre des mots, tokenization). - -@mayaffre_explorer_2019 montre un usage possible - -@mouhouche_etude_2014 application à la didactique et épistémologie. Étude de la -terminologie en physique, verbe résonner, de l'origine accoustique jusqu'à -l'application aux planètes pour véhiculer les notions d'accord et de transfert -d'énergie. Pas de textométrie mais une analyse qualitative d'occurrences. Voire -peut-être une référence à Gaston Bachelard sur la notion d'*obstacle verbal* -(Bachelard 1928). - -## Encodage XML-TEI - -### Module *dictionaries* - -The XML-TEI standard has a modular structure consisting of optional parts each -covering specific needs such as the physical features of a source document, the -transcription of oral corpora or particular requirements for textual domains -like poetry, or, in our case, dictionaries. After describing why the dedicated -module was a natural candidate to meet our needs, we formalise tools from -graph theory to browse the specifications of this standard in a rational way and -explore this module in detail. - -### A good starting point - -Data produced in the context of a project such as DISCO-LGE cannot be useful to -future scientific projects unless it is *interoperable* and *reusable*. These -are the two last key aspects of the FAIR[^FAIR] principles (*findability*, -*accessibility*, *interoperability* and *reusability*) which we strive to follow -as a guideline for efficient and quality research. It entails using standard -formats and a standard for encoding historical texts in the context of digital -humanities is XML-TEI, collectively developped by the *Text Encoding Initiative* -consortium which publishes a set of technical specifications under the form of -XML schemas, along with a range of tools to handle them and training resources. - -[^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/) - -The *dictionaries* module has been leveraged to encode dictionaries in projects -NENUFAR[^NENUFAR] and BASNUM[^BASNUM] to encode respectively the *Petit Larousse -Illustré* published by Pierre Larousse in 1905 [@bohbot2018], roughly -contemporary to our target encyclopedia and the *Dictionnaire Universel* by -Furetière, or rather its second edition edited by Henri Basnage de Beauval, an -encyclopedic dictionary from the very early 18^th^ century [@williams2017]. -These successes made it a good starting point for our own encoding but the -former does not have the encyclopedic dimension our corpus has and the latter is -a much older text which had a tremendous influence on the european encyclopedic -effort of the 18^th^ century but is not as clearly separated from the -dictionaric stem as *La Grande Encyclopédie* is. For these reasons, we could not -directly reuse the encoding schemes used in these projects and had to explore -the XML-TEI schema systematically to devise our own. - -[^NENUFAR]: [https://cahier.hypotheses.org/nenufar](https://cahier.hypotheses.org/nenufar) -[^BASNUM]: [https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003) - -The XML-TEI specification contains 590 elements, which are each documented on -the consortium's website in the online reference pages. With an average of -almost 80 possible child elements (79.91) within any given element, manually -browsing such an massive network can prove quite difficult as the number of -combinations sharply increases with each step. - -We transform the problem by representing this network as a directed graph, using -elements of XML-TEI as nodes and placing edges if the destination node may be -contained within the source node according to the schema. Please note that the -word "element" is here used with the same meaning as in the TEI documentation to -refer to the conceptual device characterised by a given tag name such as `p` or -`div` and not to a particular instance of them that may occur in a given -document. Figure @fig:dictionaries-subgraph, by using this transformation to -display the *dictionaries* module, hints at the overall complexity of the whole -specification. - -{#fig:dictionaries-subgraph} - -### Application à la lexicographie - -The previous section about the structure of the *dictionaries* module and the -features found in encyclopedias follows quite closely our own journey trying to -encode first manually then by automatic means the articles of our corpus. This -back and forth between trying to find patterns in the graph which reflects the patterns -found in the text and questioning the relevance of the results explains the -choice we ended up making but also the alternatives we have considered. - -#### Bend the semantics - -Several times, the issue of the semantics of some elements which posess the -properties we need came up. This is the case for instance of the `<sense/>` and -`<node/>` elements. It is very tempting to bend their documented semantics or to -consider that their inclusion properties is part of what defines them, and hence -justifies their ways in creative ways not directly recommended by the TEI -specifications. - -This is the approach followed by project BASNUM[^BASNUM]. In the articles -encoded for this project, `<note/>` elements are nested and used to structure -the encyclopedic developments that occur in the articles. - -We have chosen not to follow the same path in the name of the FAIR principles to -avoid the emergence of a custom usage differing from the documented one. - -#### Custom schema - -The other major reason behind our choice was the inclusion rules which exist -between TEI elements and pushed us to look for different combinations. Another -valid approach would have consisted in changing the structure of the inclusion -graph itself, that is to say modify the rules. If `<entry/>` is the perfect -element to encode article themselves, all that is really missing is the ability -to accomodate nested structures with the `<div/>` element. This would also have -the advantage of recovering the `<usg/>` and `<xr/>` elements which we have -recognised as useful and which we lose as part of the tradeoff to get nested -sections. Generating customised TEI schemas is made really easy with tools like -ROMA[^ROMA], which we used to preview our change and suggest it to the TEI -community. - -[^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/) - -Despite it not getting a wide adhesion, some suggested it could be used locally -within the scope of project DISCO-LGE. However we chose not to do so, partially -for the same reasons of interoperability as the previous scenario, but also for -reasons of sturdiness in front of future evolutions. Making sure the alternative -schema would remain useful entails to maintain it, regenerating it should the -schema format evolve, with the risk that the tools to edit it might change or -stop being maintained. - -## Traitement Automatique de la Langue - -### Étiquetage morpho-syntaxique - -### Classification - -#### Related work {#sec:relatedworks} - -Document classification is a general problem in text analysis. -Classification might mean assigning documents to a topic (infrastructure -or foreign policy), a type of content (news or advertisement), or a type -of author/speaker (Labor or Conservative). Text corpora similar to -encyclopedias include collections of political speeches (like *Hansard* -for the UK, the US Congressional Record, or the *Archives -parlementaires* for France). Here we survey existing literature that -classifies large historical text corpora using different methods. - -##### Classifying encylopedias - -In exploring methods for classifying *EDdA* articles, we follow in the -footsteps of the ARTFL project. In their 2009 paper Hornton et al -[@horton2009mining] tested Naive Bayesian classification on two -tasks: 1) classifying the originally unclassified articles and 2) -applying this model on the already classified articles to compare the -results. This second task also enabled them to explore which words were -most important for the classification result. While the paper did not -include a formal evaluation of the performance of the model, it did -offer an important close reading for a selection of the results. Later, -Roe et al [@roe2016discourses] used Latent Dirichlet Allocation (LDA) -topic modelling to analyze automatically-identified groups of articles, -and to compare these to the original classes. This research posited that -the LDA-identified topics could be understood as discourses that were -woven throughout *EDdA* and did not always neatly map onto original -classes. Our work is motivated by this earlier research. We aim to -establish a baseline for the classification task which can be improved -on in the future, and which can be compared when using different -classification metadata to fine-tune models (e.g. original classes, -ARTFL simplified *normclasses*, or ENCCRE domain ensembles. - -We also take inspiration from researchers working with other -encyclopedias. The Nineteenth-Century Knowledge project explored -rule-based and ML methods[^8] to index 400k articles across 4 editions -of the Encyclopedia Britannica [@grabus_representing_2019].[^9] Because -Britannica editors did not use the same article classes over time, -matching articles with Library of Congress Subject Headings enables -cross-edition comparison and therefore improved discovery. - -##### Classifying other texts - -Beyond encyclopedias, humanities research has largely used text -classification for subject or genre detection ("is this historical -fiction or biography?\") and author/group identification ("was this -speech given by a Labour or Conservative MP?\" -[@peterson_classification_2018]). - -The popularity of LDA topic modeling for assessing the content of large -text data is at least in part explained by the fact that it does not -require pre-existing metadata or new annotations describing documents or -document sections that can be used as training data: it is quicker to -implement. In her analysis of British parliamentary speeches (Hansard), -Guldi [@guldi_parliaments_2019] employs topic modeling to "critically -search\" for "tensions and turning points\" in political debates in the -UK. Baron et al [@barron_individuals_2018] use topic modeling as a -jumping off point from which to measure the "novelty\" and "transience\" -of speeches made during the first years of the French Revolution. This -is useful because, while the speeches are usually attributed to a -specific deputy and are dated, there is no other metadata about each -speech. - -Using both LDA and other ML models, Underwood examines the history and -instability of literary genre -[@underwood2018historical; @underwood_life_2016; @underwood2020machine] -and finds that computational methods are useful because they can -"register and compare blurry family resemblances that might be difficult -to define verbally without reductiveness\" (6) [@underwood_life_2016]. -Such a quantitative, predictive approach to text classification enables -computational humanities research to think through the results in a -different kind of interpretative environment. - -What does this all mean for encyclopedias written in eighteenth-century -France, and how does it impact our experiment design and interpretation? -First, we emphasize again that encyclopedia classes are, like genre, -culturally-constructed categories that change over time (even within the -volumes of one publication!). Second, our ability to recreate these -classes using models sheds light on the extent to which they hold fast -to certain linguistic features and points us to specific subsets of the -work that conform or do not conform to the predictions (e.g., by -evaluating true positives vs. false positives). - -##### Working in French - -Our research uses texts written in French with a smattering of other -languages (especially Latin and Greek) during the eighteenth century -[@bender2019rule]. We use some language-dependent methods on language -models pre-trained on French documents. For example, we use the French -version of FastText with CNN and LSTM experiment, but also multilingual -BERT and CamemBERT. It can no longer be said that French is a -low-resource language in Natural Language Processing, but lack of -linguistic diversity in NLP still plays a role in experiment design. -Perhaps even more important is the historical nature of our texts. We -therefore still face hurdles in model performance that do not exist when -one is working with short, modern, English texts -[@galina_russell_geographical_2014; @spence_towards_2021]. The -experiments below focus specifically on methods for French texts: in -expanding this research to enyclopedias in other languages, including -English, different considerations would necessarily be required. - -### Topic-modeling - -ÉCRIT, À PRENDRE DE DKE - -+ - -COMPLÉTER AVEC recherches sur Structural Topic-Modeling - -### NER - -À FAIRE - diff --git "a/\303\211dlA/Diachronie.md" "b/\303\211dlA/Diachronie.md" new file mode 100644 index 0000000000000000000000000000000000000000..3ea3ec36e6735100d9d24e25be1218373c146459 --- /dev/null +++ "b/\303\211dlA/Diachronie.md" @@ -0,0 +1,17 @@ +## Diachronie + +### Diachronie + +@diwersy_ressources_2017 s'attache à montrer les difficultés rencontrées en +français sur la période XVIème -> XVIIème (graphie, ordre des mots, tokenization). + +@mayaffre_explorer_2019 montre un usage possible + +@mouhouche_etude_2014 application à la didactique et épistémologie. Étude de la +terminologie en physique, verbe résonner, de l'origine accoustique jusqu'à +l'application aux planètes pour véhiculer les notions d'accord et de transfert +d'énergie. Pas de textométrie mais une analyse qualitative d'occurrences. Voire +peut-être une référence à Gaston Bachelard sur la notion d'*obstacle verbal* +(Bachelard 1928). + + diff --git "a/\303\211dlA/Genre_textuel.md" "b/\303\211dlA/Genre_textuel.md" new file mode 100644 index 0000000000000000000000000000000000000000..f7fcba367071a0e17ec2f3d26211c0d8e509407e --- /dev/null +++ "b/\303\211dlA/Genre_textuel.md" @@ -0,0 +1,93 @@ +## Genre textuel + +### Saisir la notion de genre + +@beauvisage_2001 explore le genre policier ($\rightarrow$ à lire pour voir s'il +y a une caractérisation intéressante de la notion de «genre») + +### Le cas de la lexicographie + +Les dictionnaires entretiennent une relation étroite avec la notion de +collocations et de phraséologismes: les entrées sont d'autant plus utiles qu'elles tiennent compte des +phraséologismes existant dans la langue, des modèles de langue + +@zhu_discours_2022 s'intéresse à la structure propre aux dictionnaires qui met +en relation un terme et une définition. @loiseau_dictionnaires_2011 + +If encyclopedias are thus historically more recent than dictionaries they also +depart from the latter on their approach. The purpose of dictionaries from their +origin is to collect words, to make an exhaustive inventory of the terms used in +a domain or in a language in order to associate a *definition* to them, be it a +translation in another language for a foreign language dictionary or a phrase +explaining it for other dictionaries. As such, they are collections of *signs* +and remain within the linguistic level of things. Entries in a dictionary often +feature information such as the part of speech, the pronunciation or the +etymology of the word they define. + +The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three +types of dictionaries: one to define *words*, the second to define *facts* and +the last one to define *things*, corresponding to the distinction between +language, history, and science and arts dictionaries although according to its +author, d'Alembert, each has to be of more than just one kind to be really good. +In the full title of the *Encyclopédie*, the concept is more or less equated by +means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*, +"reasoned dictionary", introducing the idea of encyclopedias as dictionaries +with additional structure and a philosophical dimension. + +Back to the "Encyclopédie" article we read that a dictionary remaining strictly +at the language level, a vocabulary, can be seen as the empty frame required for +an encyclopedic dictionary that will fill it with additional depth. Given how +d'Alembert insists on the importance of brevity for a clear definition in the +"Dictionnaire de Langues" entry, it is clear that the *encyclopédistes* did not +consider encyclopedias superior to dictionaries but really as a new subgenre +departing from them in terms of purpose. + +The first immediately visible feature that sets encyclopedias apart from +dictionaries and can be found in the *Encyclopédie* as well as in *La Grande +Encyclopédie* is the presence of subject indicators at the beginning of articles +right after the headword which organise them into a domain classification +system. Those generally cover a broad range of subjects from scientific +disciplines to litterature, and extending to political subjects and law. + +No element in the *dictionaries* module is explicitely designed for the purpose +of encoding these indicators. As we have seen above, the elements set is geared +towards the words themselves instead of the concept they represent. The closest +tool for what we need is found in the `<usg/>` element used with a specific +`type` attribute set to `dom` for "domain". Indeed several examples from the +documentation encode subject indicators very similar to the ones found in +encyclopedias within this element, but the match is not perfect either: all +appear within one of multiple senses, as if to clarify each context in which the +word can be used, as expected from the element's name, "usage". In +encyclopedias, if the domain indicator does in certain cases help to distinguish +between several entries sharing the same headword, the concept itself has +evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the +adjective *raisonné* in the rest of the title directly introduces a notion of +structure that links back to the "Systême figuré des connoissances humaines" +[@blanchard2002] which schematic structure is shown in Figure +@fig:systeme_figure. The authors have devised a branching system to classify all +knowledge, and the occurrence at the beginning of articles, more than a tool to +clear up possible ambiguities also points the reader to the correct place in +this mind map. + +{#fig:systeme_figure} + +The situation regarding subject indicators is hardly better outside of the +module. The `<domain/>` element despite its name belongs exclusively in the +header of a document and focuses on the social context of the text, not on the +knowledge area it covers. The `<interp/>` despite its name is not so much about +labeling something as an interpretation to give to a context (which subject +indicators could be if you consider that, placed at the beginning, they are used +to direct the mind frame of the readers towards a particular subject). However, +the documentation clearly demonstrates it as a tool for annotators of a +document, which text content is not part of the original document but some +additional result of an analysis performed in the context of the encoding, used +only throughout references in XML attributes. + +This point, although not the most concerning, still remains the hardest to +address but all things considered the `<usg/>` element stands out as the most +relevant. + +### Discours scientifique + +Étudié sous l'angle des ALR par @ji_hal_01956323 + diff --git "a/\303\211dlA/G\303\251ographie.md" "b/\303\211dlA/G\303\251ographie.md" new file mode 100644 index 0000000000000000000000000000000000000000..d9c95467afcf0bc882893cd8053efb124e26c883 --- /dev/null +++ "b/\303\211dlA/G\303\251ographie.md" @@ -0,0 +1,4 @@ +## La place de la géographie + + + diff --git "a/\303\211dlA/TAL.md" "b/\303\211dlA/TAL.md" new file mode 100644 index 0000000000000000000000000000000000000000..840d76dfa866b9538e04431081ef21a249db8498 --- /dev/null +++ "b/\303\211dlA/TAL.md" @@ -0,0 +1,120 @@ +## Traitement Automatique de la Langue + +### Étiquetage morpho-syntaxique + +### Classification + +#### Related work {#sec:relatedworks} + +Document classification is a general problem in text analysis. +Classification might mean assigning documents to a topic (infrastructure +or foreign policy), a type of content (news or advertisement), or a type +of author/speaker (Labor or Conservative). Text corpora similar to +encyclopedias include collections of political speeches (like *Hansard* +for the UK, the US Congressional Record, or the *Archives +parlementaires* for France). Here we survey existing literature that +classifies large historical text corpora using different methods. + +##### Classifying encylopedias + +In exploring methods for classifying *EDdA* articles, we follow in the +footsteps of the ARTFL project. In their 2009 paper Hornton et al +[@horton2009mining] tested Naive Bayesian classification on two +tasks: 1) classifying the originally unclassified articles and 2) +applying this model on the already classified articles to compare the +results. This second task also enabled them to explore which words were +most important for the classification result. While the paper did not +include a formal evaluation of the performance of the model, it did +offer an important close reading for a selection of the results. Later, +Roe et al [@roe2016discourses] used Latent Dirichlet Allocation (LDA) +topic modelling to analyze automatically-identified groups of articles, +and to compare these to the original classes. This research posited that +the LDA-identified topics could be understood as discourses that were +woven throughout *EDdA* and did not always neatly map onto original +classes. Our work is motivated by this earlier research. We aim to +establish a baseline for the classification task which can be improved +on in the future, and which can be compared when using different +classification metadata to fine-tune models (e.g. original classes, +ARTFL simplified *normclasses*, or ENCCRE domain ensembles. + +We also take inspiration from researchers working with other +encyclopedias. The Nineteenth-Century Knowledge project explored +rule-based and ML methods[^8] to index 400k articles across 4 editions +of the Encyclopedia Britannica [@grabus_representing_2019].[^9] Because +Britannica editors did not use the same article classes over time, +matching articles with Library of Congress Subject Headings enables +cross-edition comparison and therefore improved discovery. + +##### Classifying other texts + +Beyond encyclopedias, humanities research has largely used text +classification for subject or genre detection ("is this historical +fiction or biography?\") and author/group identification ("was this +speech given by a Labour or Conservative MP?\" +[@peterson_classification_2018]). + +The popularity of LDA topic modeling for assessing the content of large +text data is at least in part explained by the fact that it does not +require pre-existing metadata or new annotations describing documents or +document sections that can be used as training data: it is quicker to +implement. In her analysis of British parliamentary speeches (Hansard), +Guldi [@guldi_parliaments_2019] employs topic modeling to "critically +search\" for "tensions and turning points\" in political debates in the +UK. Baron et al [@barron_individuals_2018] use topic modeling as a +jumping off point from which to measure the "novelty\" and "transience\" +of speeches made during the first years of the French Revolution. This +is useful because, while the speeches are usually attributed to a +specific deputy and are dated, there is no other metadata about each +speech. + +Using both LDA and other ML models, Underwood examines the history and +instability of literary genre +[@underwood2018historical; @underwood_life_2016; @underwood2020machine] +and finds that computational methods are useful because they can +"register and compare blurry family resemblances that might be difficult +to define verbally without reductiveness\" (6) [@underwood_life_2016]. +Such a quantitative, predictive approach to text classification enables +computational humanities research to think through the results in a +different kind of interpretative environment. + +What does this all mean for encyclopedias written in eighteenth-century +France, and how does it impact our experiment design and interpretation? +First, we emphasize again that encyclopedia classes are, like genre, +culturally-constructed categories that change over time (even within the +volumes of one publication!). Second, our ability to recreate these +classes using models sheds light on the extent to which they hold fast +to certain linguistic features and points us to specific subsets of the +work that conform or do not conform to the predictions (e.g., by +evaluating true positives vs. false positives). + +##### Working in French + +Our research uses texts written in French with a smattering of other +languages (especially Latin and Greek) during the eighteenth century +[@bender2019rule]. We use some language-dependent methods on language +models pre-trained on French documents. For example, we use the French +version of FastText with CNN and LSTM experiment, but also multilingual +BERT and CamemBERT. It can no longer be said that French is a +low-resource language in Natural Language Processing, but lack of +linguistic diversity in NLP still plays a role in experiment design. +Perhaps even more important is the historical nature of our texts. We +therefore still face hurdles in model performance that do not exist when +one is working with short, modern, English texts +[@galina_russell_geographical_2014; @spence_towards_2021]. The +experiments below focus specifically on methods for French texts: in +expanding this research to enyclopedias in other languages, including +English, different considerations would necessarily be required. + +### Topic-modeling + +ÉCRIT, À PRENDRE DE DKE + ++ + +COMPLÉTER AVEC recherches sur Structural Topic-Modeling + +### NER + +À FAIRE + + diff --git "a/\303\211dlA/Textom\303\251trie.md" "b/\303\211dlA/Textom\303\251trie.md" new file mode 100644 index 0000000000000000000000000000000000000000..90f9592f0956814b5f570f83ecd70891cadd2211 --- /dev/null +++ "b/\303\211dlA/Textom\303\251trie.md" @@ -0,0 +1,50 @@ +## Textométrie + +### Cadre + +Origine via l'«École Française» de Benzécri [@benzecri__analyse_1973] tout à +fait du côté mathématique / statistiques. Initialement, ça ne concerne que les +mots bruts (les formes), puis la technologie permet de traiter du texte annoté +(morpho-syntaxe puis syntaxe), faisant émerger la linguistique de corpus +[@nazarenko_hal_00619268]. + +Différentes modèles de distribution statistique des mots sont employées: khi2, +loi de Poisson. @lafon_sur_1980 propose l'emploi d'une loi hypergéométrique +(choix qui restera dans la conception de TXM [@heiden2010]). + +L'ouvrage fondateur traite de l'utilisation des corpus annotés en commentant une +étude de discours de Mitterrand [@Labb1983FranoisM] \(un précurseur du corpus +des VÅ“ux de TXM [@heiden2010] ?), puis des dimensions transversales et de l'usage +contrastif dans le cadre d'études diachroniques et enfin traite de la +constitution des corpus eux-même. L'horizon est à l'époque le million de mots +(notre corpus parallèle, 8 millions de tokens). + +### Contrastes + +Sur la constitution des corpus @pincemin_heterogeneite_2012 avertit qu'il est +plus qu'un agglomérat de textes, tout en mentionnant une approche *WAC* +privilégiant les volumes sur une construction délibérée. Notre étude se situe un +peu entre les deux j'imagine ? Pas de place pour des textes non-encyclopédiques +pour contraster, et un peu les articles qu'on peut récupérer dans l'état dans +lequel on peut les récupérer. + +@laramee_production_2017 emploie une démarche contrastive pour faire opposer les +tomes de l'EDdA et mettre en évidence le rôle des différents auteurs. + +### Arbre lexico-syntaxiques récurrents + +On commence à mentionner dans @nazarenko_hal_00619268 des «stéréotypes» + +Ils sont basés sur les notions de collocations @fellbaum_idioms_2007 puis de motif +@longree_les_2008 + +sont un processus récursif et permettent de s'abstraire des réalisation de +surface contigentes à une langue @tutin_routines_2016 + +### Possibilités + +Des tournures de phrases peuvent être liées à des genres, ce qui peut être +révélé par une étude contrastive @kraif_constructions_2016, +@gonon_phraseologismes_2020 similaire à notre objectif. + + diff --git "a/\303\211dlA/XML-TEI.md" "b/\303\211dlA/XML-TEI.md" new file mode 100644 index 0000000000000000000000000000000000000000..96e25785f5505d5ae150e430352e8eefbe08d400 --- /dev/null +++ "b/\303\211dlA/XML-TEI.md" @@ -0,0 +1,111 @@ +## Encodage XML-TEI + +### Module *dictionaries* + +The XML-TEI standard has a modular structure consisting of optional parts each +covering specific needs such as the physical features of a source document, the +transcription of oral corpora or particular requirements for textual domains +like poetry, or, in our case, dictionaries. After describing why the dedicated +module was a natural candidate to meet our needs, we formalise tools from +graph theory to browse the specifications of this standard in a rational way and +explore this module in detail. + +### A good starting point + +Data produced in the context of a project such as DISCO-LGE cannot be useful to +future scientific projects unless it is *interoperable* and *reusable*. These +are the two last key aspects of the FAIR[^FAIR] principles (*findability*, +*accessibility*, *interoperability* and *reusability*) which we strive to follow +as a guideline for efficient and quality research. It entails using standard +formats and a standard for encoding historical texts in the context of digital +humanities is XML-TEI, collectively developped by the *Text Encoding Initiative* +consortium which publishes a set of technical specifications under the form of +XML schemas, along with a range of tools to handle them and training resources. + +[^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/) + +The *dictionaries* module has been leveraged to encode dictionaries in projects +NENUFAR[^NENUFAR] and BASNUM[^BASNUM] to encode respectively the *Petit Larousse +Illustré* published by Pierre Larousse in 1905 [@bohbot2018], roughly +contemporary to our target encyclopedia and the *Dictionnaire Universel* by +Furetière, or rather its second edition edited by Henri Basnage de Beauval, an +encyclopedic dictionary from the very early 18^th^ century [@williams2017]. +These successes made it a good starting point for our own encoding but the +former does not have the encyclopedic dimension our corpus has and the latter is +a much older text which had a tremendous influence on the european encyclopedic +effort of the 18^th^ century but is not as clearly separated from the +dictionaric stem as *La Grande Encyclopédie* is. For these reasons, we could not +directly reuse the encoding schemes used in these projects and had to explore +the XML-TEI schema systematically to devise our own. + +[^NENUFAR]: [https://cahier.hypotheses.org/nenufar](https://cahier.hypotheses.org/nenufar) +[^BASNUM]: [https://anr.fr/Projet-ANR-18-CE38-0003](https://anr.fr/Projet-ANR-18-CE38-0003) + +The XML-TEI specification contains 590 elements, which are each documented on +the consortium's website in the online reference pages. With an average of +almost 80 possible child elements (79.91) within any given element, manually +browsing such an massive network can prove quite difficult as the number of +combinations sharply increases with each step. + +We transform the problem by representing this network as a directed graph, using +elements of XML-TEI as nodes and placing edges if the destination node may be +contained within the source node according to the schema. Please note that the +word "element" is here used with the same meaning as in the TEI documentation to +refer to the conceptual device characterised by a given tag name such as `p` or +`div` and not to a particular instance of them that may occur in a given +document. Figure @fig:dictionaries-subgraph, by using this transformation to +display the *dictionaries* module, hints at the overall complexity of the whole +specification. + +{#fig:dictionaries-subgraph} + +### Application à la lexicographie + +The previous section about the structure of the *dictionaries* module and the +features found in encyclopedias follows quite closely our own journey trying to +encode first manually then by automatic means the articles of our corpus. This +back and forth between trying to find patterns in the graph which reflects the patterns +found in the text and questioning the relevance of the results explains the +choice we ended up making but also the alternatives we have considered. + +#### Bend the semantics + +Several times, the issue of the semantics of some elements which posess the +properties we need came up. This is the case for instance of the `<sense/>` and +`<node/>` elements. It is very tempting to bend their documented semantics or to +consider that their inclusion properties is part of what defines them, and hence +justifies their ways in creative ways not directly recommended by the TEI +specifications. + +This is the approach followed by project BASNUM[^BASNUM]. In the articles +encoded for this project, `<note/>` elements are nested and used to structure +the encyclopedic developments that occur in the articles. + +We have chosen not to follow the same path in the name of the FAIR principles to +avoid the emergence of a custom usage differing from the documented one. + +#### Custom schema + +The other major reason behind our choice was the inclusion rules which exist +between TEI elements and pushed us to look for different combinations. Another +valid approach would have consisted in changing the structure of the inclusion +graph itself, that is to say modify the rules. If `<entry/>` is the perfect +element to encode article themselves, all that is really missing is the ability +to accomodate nested structures with the `<div/>` element. This would also have +the advantage of recovering the `<usg/>` and `<xr/>` elements which we have +recognised as useful and which we lose as part of the tradeoff to get nested +sections. Generating customised TEI schemas is made really easy with tools like +ROMA[^ROMA], which we used to preview our change and suggest it to the TEI +community. + +[^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/) + +Despite it not getting a wide adhesion, some suggested it could be used locally +within the scope of project DISCO-LGE. However we chose not to do so, partially +for the same reasons of interoperability as the previous scenario, but also for +reasons of sturdiness in front of future evolutions. Making sure the alternative +schema would remain useful entails to maintain it, regenerating it should the +schema format evolve, with the risk that the tools to edit it might change or +stop being maintained. + + diff --git "a/\303\211dlA/text.sh" "b/\303\211dlA/text.sh" new file mode 100755 index 0000000000000000000000000000000000000000..47c073f00b1ca4f6c3d73892a7112c4353a979fd --- /dev/null +++ "b/\303\211dlA/text.sh" @@ -0,0 +1,10 @@ +#!/bin/sh + +source ./chapter.sh "État de l'Art" + +cat ÉdlA/Textométrie.md +cat ÉdlA/Géographie.md +cat ÉdlA/Genre_textuel.md +cat ÉdlA/Diachronie.md +cat ÉdlA/XML-TEI.md +cat ÉdlA/TAL.md