Skip to content
Snippets Groups Projects
Commit bd86a2db authored by Alice Brenon's avatar Alice Brenon
Browse files

Restructure repos to make it more modular, add filter to handle bibliography...

Restructure repos to make it more modular, add filter to handle bibliography and a weird .SECONDEXPANSION trick to the Makefile
parent 0ad7d16f
No related branches found
No related tags found
No related merge requests found
Showing
with 820 additions and 786 deletions
# Bibliography
# Conclusion {-}
## Regrets
## Souhaits
## Souhaits
#!/bin/sh
source ./chapter.sh 'Conclusion {-}'
cat Conclusion/Regrets.md
cat Conclusion/Souhaits.md
## Statistiques
### Mesure de centralité
(DKE)
# Études contrastives
## Analyse lexico-grammaticale (Lexicométrie, Textométrique, ?…)
### Contrastes Internes
......@@ -19,11 +16,4 @@ Np vs. Nc
#### Adjectifs préférés
## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents
## Statistiques
### Mesure de centralité
(DKE)
## Phraséologie, discours disciplinaires et Arbres Lexico-syntaxiques Récurrents
#!/bin/sh
source ./chapter.sh 'Études contrastives'
cat Contrastes/Lexicométrie.md
cat Contrastes/Phraséologie.md
cat Contrastes/Centralité.md
## Annotation en parties de discours et syntaxe
### Jeu d'étiquettes
Nous utilisons le [jeu d'étiquettes]() du projet
[PRESTO](http://presto.ens-lyon.fr/)
Alors non en fait Stanza c'est bien aussi avec les
[UPOS](https://universaldependencies.org/docs/u/pos/)
### Chaînes de traitement
- PRESTO
- Stanza
# Préparation et enrichissement du corpus
## Formats et états des textes
### L'Encyclopédie
In common parlance, the terms "dictionaries" and "encyclopedias" are used as
near synonyms to refer to books compiling vast amounts of knowledge into lists
of definitions ordered alphabetically. Their similarity is even visible in the
way they are coordinated in the full title of the *Encyclopédie* which is
probably the most famous work of the genre and a symbol of the Age of
Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it
was much more unusual and in fact controversial when Diderot and d'Alembert
decided to use it in the title of their book.
The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
Encyclopedia"). At the time the word still mostly refers to the abstract concept
of mastering all knowledges at once. Furetière adds that it's a quality one
is unlikely to possess, and even seems to condemn its search as a form of
hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
("it is a recklessness for a man to want to possess Encyclopedia").
Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
at the end of the 17^th^ century and attacked in the
*Dictionnaire Universel François et Latin*, commonly refered to as the
*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
"Encyclopédie" remained unchanged in the four editons issued between 1721 and
1752, mocking the use of the word and discouraging his readers to pursue it. In
that intent, he quotes a poem from Pibrac encouraging people to specialise in
only one discipline lest they should not reach perfection, based on an
argumentation that resembles the saying "Jack of all trades, master of none". It
is all the more interesting that the definition remains unaltered until 1752,
one year after the publication of the first volume of the *Encyclopédie*. The
Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
*Encyclopédie* which they managed to get banned the same year by the Council of
State on the charge of attempting to destroy the royal authority, inspiring
rebellion and corrupting morality in general. There is much more at stake than
words here, but the attempt to deprecate the word itself is part of their fight
against the philosophers of the Enlightenment.
The attacks do not remain ignored by Diderot who starts the very definition of
the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
mere self-doubt that their authors should not generalise to anyone, then leaves
the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues
that a collaborative work can achieve much more than any talented man could:
what could possibly not be within reach of a single man, within a single
lifetime may be achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defence of the feasability of
the project quite seriously, considering the fact that they got the
*Encyclopédie*'s privileges to be revoked again six years after its publication
was resumed [@moureau2001]. As a consequence, the remaining ten volumes
containing the text of the articles had to be published illegally until 1765,
thanks to the secret protection of Malesherbes who — despite being head of royal
censorship — saved the manuscripts from destruction. They were printed secretly
outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
Following the high demand from the booksellers who feared they would lose the
money they had invested in the project, a special privilege was issued for the
volumes containing the plates, which were released publicly from 1762 to 1772.
In any case, in their last edition in 1771 the authors of the *Dictionnaire de
Trevoux* had no choice but to acknowledge the success of the encyclopedic
projects of the 18^th^ century. In this version, the definition
was entirely reworked, mildly stating that good encyclopedias are difficult to
make because of the amount of knowledge necessary and work needed to keep up
with scientific progress instead of calling the effort a parody. It credits
Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
though quite explicitly to Diderot and d'Alembert's project by naming the
collective "Une Société de gens de Lettres" and writing that it started in 1751.
Even more importantly, two new entries were added after it: one for the
adjective "encyclopédique" and another one for the noun "encyclopédiste",
silently admitting how the project had changed its time and the relation to
knowledge itself.
#### Contexte de l'œuvre
#### Versions disponibles
L'ARTFL[^ARTFL] en propose une version.
[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)
#### Traitements
### La Grande Encyclopédie
#### Contexte de l'œuvre
*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des
Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut
publiée en France entre 1885 et 1902 par une équipe de plus de deux cent
spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ
1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise
encyclopédique française majeure à marcher dans les traces de l'ancêtre
prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt.
Le titre complet de l'œuvre, déjà, montre sa volonté de filiation avec l'*EDdA*,
volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022].
#### Versions disponibles
Une version numérique de cette œuvre a été réalisée par la BnF et mise en
ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition
originale, elle comprend une image par page de l'œuvre, numérisée en niveau de
gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été
tirée une version partielle du texte par application d'un programme de
reconnaissance optique de caractères ([@=OCR]). Cette version présente un
certains nombre de limite qui empêchait de mener une étude intégrale du texte
par des moyens automatiques comme la textométrie.
[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071)
D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés,
certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun
texte n'est disponible pour ces volumes sur le site de
Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive
mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette
à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été
choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas
contigus ni au début ni à la fin de l'œuvre. Ensuite, cette version en «texte
brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une
annotation très superficielle et n'est en particulier par segmentée en article.
Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque
l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener
une étude contrastive en groupant les articles par domaine de connaissance ou
par auteur que d'observer la structure des domaines en comparant entre deux
encyclopédies quels articles ont été conservés ou non, et le cas échéant si le
domaine de connaissance qui leur est associé est le même. Enfin, des erreurs
dans la détection de l'organisation de la page ([@=OLR]) obscurcissent
significativement le texte en opérant des permutations locales de son contenu
qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui
complique nettement la segmentation du texte en article — et dans tous les cas
endommager la structure des phrases, ce qui est vient introduire des erreurs
dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que
nous avons besoin d'appliquer au texte pour faire de la textométrie.
[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#)
Dans le but de pallier à ces défauts, le projet CollEx Persée
DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en
partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette
nouvelle version a été réalisée à partir de photographies d'un exemplaire
original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal].
[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/)
[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t)
[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal)
Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala],
le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num,
d'une nouvelle version de l'œuvre sous différents formats.
[^nakala]: [https://nakala.fr/](https://nakala.fr/)
#### Encodage
##### Structure du module *dictionaries*
**Definitions**
By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
an edge" we define *inclusion paths* which allow us to explore which elements
may be nested under which other.
The nodes visited along the way represent the intermediate XML elements to
construct a valid XML tree according to the TEI schema. Given the top-down
semantics of those trees, we call the length of an inclusion path its *depth*.
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
another one.
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle and we may be tempted in our context to refine this and
name them *inclusion cycles*. The `<address/>` element provides us with an
example for this configuration: although an `<address/>` element may not
directly contain another one, it may contain a `<geogName/>` which, in turn, may
contain a new `<address/>` element. From a graph theory perspective, we can say
that it admits an inclusion cycle of length two.
**Applications**
Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
allows us to explore the shortest inclusion paths that exist between elements.
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides us with an
efficient way to check whether a given element may or not be nested at all under
another one and gives a lower bound on the length of the path to expect. Of
course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build.
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
elements which cross module boundaries freely. The general graph formalism
enables us to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.
For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
through a `<form/>` or a `<gramGrp/>`.
On the other hand, trying to discover the shortest inclusion path to `<pos/>`
from the `<TEI/>` root of the document yields a `<standOff/>`, an element
dedicated to store contextual data that accompanies but is not part of the text,
not unlike an annex, and widely unrelated to the context of encoding an
encyclopedia.
A last relevant example on the use of these methods can be given by querying the
shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
yields an inclusion directly through `<entryFree/>` (with an inclusion path of
length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
not what we want depending on the regularity of the articles we are encoding and
the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
length 3 returns as expected the path through `<entry/>`, among others. Overall,
we get a good general idea: `<pos/>` does not need to be nested very deep, it
can appear quite near the "surface" of article entries.
##### Limites
###### The `<entry/>` element
The central element of the *dictionaries* module is the `<entry/>` element meant
to encode one single entry in a dictionary, that is to say a head word
associated to its definition. It is the natural way in from the `<body/>`
element to the dictionary module: indeed, although `<body/>` may also contain
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
`<entry/>` while the latter is a device to group several related entries
together. Both can contain an `<entry/` directly while no obvious inclusion
exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which we define as strictly inferior to 5, that is twice the
average shortest depth between any two nodes) either include `<figure/>` or
`<castList/>`, two very specific elements which should not need to appear in an
article in general, showing that the purpose of `<entry/>` is not to contain an
`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
documentation but also the structure of the elements graph evidence `<entry/>`
as the natural top-most element for an article. This somewhat contrived example
hopes to further demonstrate the application of a graph-centred approach to
understand the inner workings of the XML-TEI schema.
###### Information about the headword itself
Once a block for an article is created, it may contain elements useful to
represent various of its features. Its written and spoken forms are usually
encoded by `<form/>` elements. Grammatical information like the `<case/>`,
`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`,
along with information about the categories it belongs to like `<iType/>` for
its inflection class in languages with a declension system or `<pos/>` for its
part-of-speech. The `<etym/>` element is made to hold the etymology of an entry.
In the case when there are alternative spellings in varieties of the language or
if the spelling has changed over time, `<usg/>` can be used.
All these examples are by no means an exhaustive list; the complete set provides
the encoder with a toolbox to describe all the information related to the form
the entry is found at and seems general enough to accomodate the structure of
any book indexing entries by words.
###### Cross-references
A common feature shared by dictionaries and encyclopedias is the ability to
connect entries together by using a word or short phrase as the link, referring
the reader to the related concept. This is known as cross-references and can
appear either when the definition of a term is adjacent to another one or to
catch alternative spellings where some readers might expect to find the word and
redirect them to the form chosen as the reference. In XML-TEI, this is done with
the `<xr/>` element. It usually contains the whole phrase performing the
redirection, with an imperative locution like "please see […]".
The "active" part of the cross-reference, that is the very word within the
`<xr/>` that is considered to be the link or, to make a modern-day HTML
metaphor, the region that would be clickable, is represented by a `<ref/>`
element. Though it is not specific to the *dictionaries* module, we include it
in this description of the toolbox because it is particularly useful in the
context of dictionaries. This element may have a target attribute which points
to the other resource to be accessed by the interested reader.
###### Definitions
The remaining part of entries is also usually the largest and represents the
content associated to the headword by the entry. In a dictionary, that is its
meaning.
The `<sense/>` element is a valid child for `<entry/>` and groups together a
definition of the term with `<def/>`, usage examples with `<usg/>` (another use
of this versatile element) and other high-level information such as translations
in other languages. Both `<def/>` and `<usg/>` elements may appear directly
under the `<entry/>`.
###### Structural remarks
Before concluding this description of the *dictionaries* module from the
perspective of someone trying to concretely encode a particular dictionary or
encyclopedia, we make use of the graph approach again to evidence some its
aspects in terms of inclusion structure.
First, it is remarkable that all elements in the *dictionaries* module have a
cyclic inclusion path, that is to say, there is an inclusion path from each
element of this module to itself. Although having such a cycle is a widespread
property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
of the 557 elements in the other modules), all 33 elements of the *dictionaries*
module having one is far above this average. In addition, the cycles appear to
be rather short, with an average length of 2.00 versus 2.50 in the rest of the
population. This observation is all the more surprising considering the fact
that the *dictionaries* module contains short "leaf" elements like `<pos/>`
which should not obviously need to admit cycles since one rather expects them to
contain only one word, like `<pos>adj</pos>` in the example given in the
official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the *core* module (see the case of the
`<ref/>` element above), the *dictionaries* module appears somewhat isolated
from important structural elements like `<head/>` or `<div/>`. Indeed, computing
all the paths from either `<entry/>` or `<sense/>` elements to the latter of
length shorter or equal to 5 by a systematic traversal of the graph yields
exclusively paths (respectively 9042 and 39093 of them) containing either a
`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
suggests, is used to encode text that does not quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a `<body/>` as direct child of
`<floatingText/>`, neatly separating its content as independent. The purpose of
the second one, although its name — short for apparatus — is less clear, is to
wrap together several versions of the same excerpts, for instance when there are
several possible readings of an unclear group of words in a manuscript, or when
the encoder is trying to compile a single version of a piece of work from
several sources which disagree over some passage. In both case, it appears
obvious that it is not something that is expected to occur naturally in the
course of an article in general.
Thus, despite a rather dense internal connectivity, the *dictionaries* module
fails to provide encoders with a device to represent recursively nesting
structures like `<div/>`.
The situation regarding subject indicators is hardly better outside of the
module. The `<domain/>` element despite its name belongs exclusively in the
header of a document and focuses on the social context of the text, not on the
knowledge area it covers. The `<interp/>` despite its name is not so much about
labeling something as an interpretation to give to a context (which subject
indicators could be if you consider that, placed at the beginning, they are used
to direct the mind frame of the readers towards a particular subject). However,
the documentation clearly demonstrates it as a tool for annotators of a
document, which text content is not part of the original document but some
additional result of an analysis performed in the context of the encoding, used
only throughout references in XML attributes.
This point, although not the most concerning, still remains the hardest to
address but all things considered the `<usg/>` element stands out as the most
relevant.
###### The notion of meaning
Notwithstanding the correct way to represent domains of knowledge, their extent
itself raises concerns regarding the *dictionaries* module. Indeed, among the
vast collection of domains covered in encyclopedias in general and in *La Grande
Encyclopédie* in particular are historical articles and biographies. If the
notion of meaning can appear at least ill-fitting for a text describing a series
of historical events, one may still argue that it groups them into a concept and
associates it to the name of the event. But when it comes to relating the life
of a person, describing their relation to events and other persons comes out
even further from the notion of meaning. Entries such as the one about SANJO
Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29](figure/article/LGE/sanjo_t29.png){#fig:sanjo}
Moreover, encyclopedias, because of all that they have inherited from the
philosophical Enlightenment, are not only spaces designed to assert, they also
intrinsically include an interrogative component. Some articles lay down the
basis required to understand the complexity of an issue and invite the reader to
consider it without providing a definitive answer, going as far as to explicitly
use question marks as in the article "Action" displayed in Figure @fig:action.
![Excerpt from article "Action", in La Grande Encyclopédie, tome 1](figure/article/LGE/action_t1.png){#fig:action}
In this extract, the author devises a hypothetical situation to illustrate how
difficult it is to draw the line between two supposedly mutually exclusive
subcategories of legal actions. The whole point of the passage is to convey the
idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
`<def/>` element would be an utter misnomer.
As a result, the use of `<sense/>` and `<def/>` is not appropriate for
encyclopedic content in general.
###### Nested structures
The final difficulty can be considered as a partial consequence of the previous
one on the structure of articles. The difficulty to define complex concepts is
the very reason why authors approach their subjects from various angles,
circumnavigating it as a best approximation. This strategy favours long,
structured developments with sections and subsections covering the multiple
aspects of the topic: from a historical, political, scientific point of view…
The longest articles, such as article "Europe" shown in Figure @fig:europe, can
thus span several dozens of pages. They can contain substructures with titles on
at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of
which are in turn generally developed over several paragraphs.
![La Grande Encyclopédie, tome 16, article "Europe", spanning from p.782 to p.846, that is 64 pages, and ending after a bibliography longer than one column of text](figure/article/LGE/europe_t16.png){#fig:europe}
The nested structure that we have just evidenced demands of course a nesting
structure to accomodate it. More precisely it guides our search of XML elements
by giving us several constraints: we are looking for a pair of elements, the
first representing a (sub)section must be able to include both itself and the
second element, which does not have any special constraint except the one to
have a semantics compatible with our purpose of using it to represent section
titles. In addition, the first element must be able to contain several `<p/>`
elements, `<p/>` being the reference element to encode paragraphs according to
the XML-TEI documentation.
We have seen that the *dictionaries* module was equiped with a questionable but
possible element for subject domains. However, it does not include any element
for section titles. In the rest of the TEI specification, the elements `<head/>`
and `<title/>` — the latter with the possibility to set its `type` attribute to
`sub` — stand out as the best candidates for the semantics condition on the
second element.
##### Choix
###### Candidates in the *dictionaries* module
Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
It is remarkable that even replacing the `<entry/>` element for the root of each
article with an `<entryFree/>`, an element supposed to relax some constraint to
accomodate more unusual structure in dictionaries does not bring any
improvement.
The lack of results from these simple queries forces us to somewhat release the
constraints on the encoding we are willing to use. We can for instance make the
asumption that the occurrence of an intermediate element could be needed between
the element wrapping the whole article and the recursing one used to encode each
section. This "section" element could also need a companion element to be able
to include itself, or, to formalise it in terms of graph theory, we could relax
the condition that this element admits a loop to consider instead cycles of a
given (small, this still needs to represent a fairly direct inclusion) length to
be enough. We simultaneously extend the maximum depth of the inclusion paths we
are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
By setting this depth to 3, that is, by accepting one intermediate element to
occur in the middle of each one of the inclusion paths that define the structure
required to encode encyclopedic discourse, we find 21 elements but none of them
stand out as an obvious good solution: all paths to include the `<p/>` element
from any *dictionaries* element either contains a `<figure/>` (which we have
encountered earlier when we were practising our graph approach to search for
inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
general), a `<stage/>` (reserved to stage direction in dramatic works) or a
`<state/>` (used to describe a temporary quality in a person or place), again
not even close to what we want. The paths to either `<head/>` or `<title/>` are
similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns
the exact same candidates. If that is not a thorough proof that none of these
elements could fulfill our purpose, it is a fact than no element in this module
appears as an obvious good solution and a serious hint to keep looking somewhere
else.
###### Widening the search
We hence widen our search to include elements outside the *dictionaries* module
which could be used to encode our sections and subsections, under the same
constraint as before to try and find a composite solution that would remain
under the `<entry/>` element even if resorting to subcomponents outside of the
dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>`
and `<note/>`.
The first one as we have repeatedly underlined is meant for graphic information
and is not suitable for text content in general.
The purpose of `<metamark/>` is to transcribe the edition marks than may appear
on a particular primary source in order to alter the normal flow of the text and
suggest an alternative reading (deletion, insertion, reordering, this is about a
human editing the text from a given physical copy of it), but it is
unfortunately of no use to encode a section of an article.
The first element that might at least resemble what we are looking for is the
last one, `<note/>`. It is meant to contain text, is about explaning something
and seems general enough (not specific to a given genre, or to the occurrence of
a particular object on the page). Unfortunately, its semantics still seems a bit
off compared to our need. The documentation describes it as an "additional
comment" which appears "out of the main textual stream" whereas the long
developments in articles are the very matter of the text of encyclopedias, not
mere remarks in the margins or at the foot of pages.
##### Implémentation
The above remarks explain why the *dictionary* module is unable to represent
encyclopedias, where the notion of "meaning" is less central that in
dictionaries and where discourse with nested structures of arbitrary depth can
occur. Even composite encodings using elements outside of the *dictionaries*
module under an `<entry/>` element do not meet our requirements. Since the
*core* module of course accomodates these structures by means of the `<div/>`,
`<head/>` and `<p/>` elements which have the additional advantage of carrying
less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
using them which we recommend using for other projects aiming at representing
encyclopedias.
To remain consistent with the above remarks we will only concern ourselves with
what happens at the level of each article, right under the `<body/>` element.
Everything related to metadata happens as expected in the file's `<teiHeader/>`
which is well-enough equiped to handle them. In order to present our scheme
throughout the following section we will be progressively encoding a reference
article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
![La Grande Encyclopédie, tome 9, article "Cathète"](figure/article/LGE/cathète_t9.png){#fig:cathete-photo}
###### The scheme
Remaining within the *core* module for the structure, almost all useful elements
are available and our encoding scheme merely quotes the official documentation.
Each article is represented by a `<div/>`. We suggest setting an `xml:id`
attribute on it with the head word of the entry — unique in the whole corpus, or
made so by suffixing a number representing its rank among the various
occurrences, even when there's only one for the sake of regularity — as its
value, normalised to lowercase, stripping spaces and replacing all
non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
element on the article "Cathète" previously displayed.
![The container `div` element for article "Cathète"](figure/article/LGE/cathète_0.png){#fig:cathete-xml-0}
Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
The one disappointment of the encoding scheme we are defining in this chapter is
the lack of support for a proper way to encode subject indicators.
The best candidate we have found so far was `<usg/>` from the *dictionaries*
module but it is not available directly under a `<head/>` element. All inclusion
paths from the latter to the former of length less than or equal to 3 contain
irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
must be discarded. The next best elements appear to be `<term/>` (not very
accurate) and `<rs/>` ("referring string", quite a general semantics but a
possible match — subject indicators refer to a given domain of knowledge —
although all the examples in the documentation refer to concrete persons,
places or object, not to the abstract objects that mathematics or poetry are).
For this reason, we do not recommend any special encoding of the subject
indicator but leave it open to each particular context: they are often
abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
are not labeled by a knowledge domain but usually include the first name of the
person when it is known so in that case an element like `<persName/>` is still
appropriate. This choice applied to the same article "Cathète" produces Figure
@fig:cathete-xml-1.
![Encoding the head word of article "Cathète"](figure/article/LGE/cathète_1.png){#fig:cathete-xml-1}
We then propose to wrap each different meaning in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. The `<div/>`s should be numbered
according to the order they appear in with the `n` attribute starting from `0`
as shown in Figure @fig:cathete-xml-2.
![The empty structure for the only meaning of the word "Cathète"](figure/article/LGE/cathète_2.png){#fig:cathete-xml-2}
In addition, each line within the article must start with a `<lb/>` to mark its
beginning including before the `<head/>` element as demonstrated by Figure
@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that
in the dense layout of encyclopedias, the carriage return separating two
articles is meaningful. Stating each new line explicitly keeps enough
information to reconstruct a faithful facsimile but it also has the advantage of
highlighting the fact than even though the definition is cut from the headword
by being in a separate XML element, they still occur on the same line, which is
a typographic choice usually made both in encyclopedias and dictionaries where
space is at a premium. .
To complete the structure, the various sections and subsections occurring
within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
filled with `<p/>` for paragraphs which can each be titled with `<head/>`
elements local to each `<div/>`.
![A complete encoding of article "Cathète"](figure/article/LGE/cathète_3.png){#fig:cathete-xml-3}
Some articles such as "Boumerang" have figures with captions, as illustrated by
Figure @fig:boumerang-photo, which should be encoded the standard way by
`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml.
![La Grande Encyclopédie, tome 7, article "Boumerang"](figure/article/LGE/boumerang_t7.png){height=300px #fig:boumerang-photo}
![Encoding the figure in article "Boumerang" and its captions](figure/article/LGE/boumerang.png){#fig:boumerang-xml}
Another issue arising from giving up on `<entry/>` is the unavailability of the
`<xr/>` element, not allowed under any of the *core* elements we use but which
is useful to represent cross-references occurring in encyclopedias as well as in
dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
We prefer to use the `<ref/>` element instead which is available in the context
of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
Another solution would have been to introduce a `<dictScrap/>` element for the
sole purpose of placing an `<xr/>` but we advocate against it on account of the
verbosity it would add to the encoding and the fact that it implicitly suggests
that the previous context was not the one of a dictionary.
![La Grande Encyclopédie, tome 18, article "Gelocus"](figure/article/LGE/gelocus_t18.png){#fig:gelocus-photo}
![Encoding the cross-references in article "Gelocus"](figure/article/LGE/gelocus.png){#fig:gelocus-xml}
A typical page of an encyclopedia also features peritext elements, giving
information to the reader about the current page number along with the headwords
of the first and last articles appearing on the page. Those can be encoded by
`<fw/>` elements ("forme work") which `place` and `type` attributes should be
set to position them on the page and identify their function if it has been
recognised (those short elements on the border of pages are the ones typically
prone to suffer damages or be misread by the OCR).
Finally there are other TEI elements useful to represent "events" in the flow of
the text, like the beginning of a new column of text or of a new page. Figure
@fig:alcala-photo shows the top left of the last page of the first tome of *La
Grande Encyclopédie* which features peritext elements while marking the
beginning of a new page. The usual appropriate elements (`<pb/>` for page
beginning, `<cb/>` for column beginning) may and should be used with our
encoding scheme as demonstrated by Figure @fig:alcala-xml.
![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](figure/article/LGE/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
![Encoding the beginning of a page in article "Alcala-de-Hénarès"](figure/article/LGE/alcala.png){#fig:alcala-xml}
###### Currently implemented
The reference implementation for this encoding scheme is the program
soprano[^soprano] developed within the scope of project DISCO-LGE to
automatically identify individual articles in the flow of raw text from the
columns and to encode them into XML-TEI files. Though this software has already
been used to produce the first TEI version of *La Grande Encyclopédie*, it does
not yet follow the above specification perfectly. Figure
@fig:cathete-xml-current shows the encoded version of article "Cathète" it
currently produces:
[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
![The current encoding of article "Cathète" produced by `soprano`](figure/article/LGE/cathète_current.png){#fig:cathete-xml-current}
The headword detection system is not able to capture the subject indicators yet
so it appears outside of the `<head/>` element. No work is performed either to
expand abbreviations and encode them as such, or to distinguish between domain
and people names.
Likewise, since the detection of titles at the beginning of each section is not
complete, no structure analysis can be performed at the moment on the textual
development inside the article and it is left unstructured, directly under the
entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
paragraphs are not yet identified and for this reason not encoded.
However, the figures and their captions are already handled correctly when they
occur. The encoder also keeps track of the current lines, pages, and columns and
inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
numbers pages so that the numbering corresponding to the physical pages are
available, as compared to the "high-level" pages numbers inserted by the
editors, which start with an offset because the first, blank or almost empty
pages at the beginning of each book do not have a number and which sometimes have
gaps when a full-page geographical map is inserted since those are printed
separately on a different folio which remains outside of the textual numbering
system. The place at which these layout-related elements occur is determined by
the place where the OCR software detected them and by the reordering performed
by `soprano` when inferring the reading order before segmenting the articles.
###### The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *Encyclopédie* comprises
over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
version produced by `soprano` created 160k articles, but their segmentation is
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an overestimation of the total number).
XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
information (for the second one, adjacent to a notion as elusive as truth)
which requires a very deep understanding of a text in its entirety and about
which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented above
contain information about the positions of the elements or their relation to one
another. Those with an additional semantics implication like `<head/>` can be
inferred simply from their position and the frequent use of a special typography
like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In *La Grande
Encyclopédie*, virtually all the redirections (that is, to the extent of our
knowledge, absolutely all of them though of course some special case may exist,
but they are statistically rare enough that we have not found any yet) appear
within parenthesis, and start with the verb "voir" abbreviated as a single,
capital "V." as illustrated above in the article "Gelocus" (see again Figure
@fig:gelocus-photo).
Although this has not been implemented yet either, we hope to be able to detect
and exploit those patterns to correctly encode cross-references. Getting the
`target` attributes right is certainly more difficult to achieve and may require
processing the articles in several steps, to first discover all the existing
headwords — and hence article IDs — before trying to match the words following
"V." with them. Since our automated encoder handles tomes separately and since
references may cross the boundaries of tomes, it cannot wait for the target of a
cross-reference to be discovered by keeping the articles in memory before
outputting them.
This is in line with the last important aspect of our encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can, for
instance the empty elements already discussed above. For articles, it pushes
lines onto a stack and flushes it each time it encounters the beginning of the
following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over three minutes per tome, the total processing time can be
lowered to around forty minutes on a machine with 16Go of RAM for the whole of
*La Grande Encyclopédie* instead of over one hour and a half.
## Les domaines
### Systèmes de domaines
......@@ -1499,19 +776,4 @@ TODO Comment être plus maligne dans l'association ?
TODO Grammaire des articles
## Annotation en parties de discours et syntaxe
### Jeu d'étiquettes
Nous utilisons le [jeu d'étiquettes]() du projet
[PRESTO](http://presto.ens-lyon.fr/)
Alors non en fait Stanza c'est bien aussi avec les
[UPOS](https://universaldependencies.org/docs/u/pos/)
### Chaînes de traitement
- PRESTO
- Stanza
## Formats et états des textes
### L'Encyclopédie
In common parlance, the terms "dictionaries" and "encyclopedias" are used as
near synonyms to refer to books compiling vast amounts of knowledge into lists
of definitions ordered alphabetically. Their similarity is even visible in the
way they are coordinated in the full title of the *Encyclopédie* which is
probably the most famous work of the genre and a symbol of the Age of
Enlightenment. If the word "encyclopedia" is nowadays part of our vocabulary, it
was much more unusual and in fact controversial when Diderot and d'Alembert
decided to use it in the title of their book.
The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
"circle", and *παιδεία*, "knowledge". This meaning is the one used for instance
by Rabelais in *Pantagruel*, when he has Thaumaste declare that Panurge opened
to him "le vray puys et abisme de Encyclopedie" ("the true well and abyss of
Encyclopedia"). At the time the word still mostly refers to the abstract concept
of mastering all knowledges at once. Furetière adds that it's a quality one
is unlikely to possess, and even seems to condemn its search as a form of
hubris: "C'est une témérité à un homme de vouloir posséder l'Encyclopédie"
("it is a recklessness for a man to want to possess Encyclopedia").
Beyond this moral reproach, the concept that pleased Rabelais was somewhat dated
at the end of the 17^th^ century and attacked in the
*Dictionnaire Universel François et Latin*, commonly refered to as the
*Dictionnaire de Trevoux*, as utterly "burlesque" ("parodic"). The entry for
"Encyclopédie" remained unchanged in the four editons issued between 1721 and
1752, mocking the use of the word and discouraging his readers to pursue it. In
that intent, he quotes a poem from Pibrac encouraging people to specialise in
only one discipline lest they should not reach perfection, based on an
argumentation that resembles the saying "Jack of all trades, master of none". It
is all the more interesting that the definition remains unaltered until 1752,
one year after the publication of the first volume of the *Encyclopédie*. The
Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
*Encyclopédie* which they managed to get banned the same year by the Council of
State on the charge of attempting to destroy the royal authority, inspiring
rebellion and corrupting morality in general. There is much more at stake than
words here, but the attempt to deprecate the word itself is part of their fight
against the philosophers of the Enlightenment.
The attacks do not remain ignored by Diderot who starts the very definition of
the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
mere self-doubt that their authors should not generalise to anyone, then leaves
the main point to a latin quote by chancelor Bacon [@lojkine2013], who argues
that a collaborative work can achieve much more than any talented man could:
what could possibly not be within reach of a single man, within a single
lifetime may be achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defence of the feasability of
the project quite seriously, considering the fact that they got the
*Encyclopédie*'s privileges to be revoked again six years after its publication
was resumed [@moureau2001]. As a consequence, the remaining ten volumes
containing the text of the articles had to be published illegally until 1765,
thanks to the secret protection of Malesherbes who — despite being head of royal
censorship — saved the manuscripts from destruction. They were printed secretly
outside of Paris and the books were (falsely) labeled as coming from Neufchâtel.
Following the high demand from the booksellers who feared they would lose the
money they had invested in the project, a special privilege was issued for the
volumes containing the plates, which were released publicly from 1762 to 1772.
In any case, in their last edition in 1771 the authors of the *Dictionnaire de
Trevoux* had no choice but to acknowledge the success of the encyclopedic
projects of the 18^th^ century. In this version, the definition
was entirely reworked, mildly stating that good encyclopedias are difficult to
make because of the amount of knowledge necessary and work needed to keep up
with scientific progress instead of calling the effort a parody. It credits
Chamber's *Cyclopædia* for being a decent attempt before referring anonymously
though quite explicitly to Diderot and d'Alembert's project by naming the
collective "Une Société de gens de Lettres" and writing that it started in 1751.
Even more importantly, two new entries were added after it: one for the
adjective "encyclopédique" and another one for the noun "encyclopédiste",
silently admitting how the project had changed its time and the relation to
knowledge itself.
#### Contexte de l'œuvre
#### Versions disponibles
L'ARTFL[^ARTFL] en propose une version.
[^ARTFL]: [http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/)
#### Traitements
### La Grande Encyclopédie
#### Contexte de l'œuvre
*La Grande Encyclopédie, Inventaire raisonné des Sciences, des Lettres et des
Arts par une Société de savants et de gens de lettres* (désormais *LGE*) fut
publiée en France entre 1885 et 1902 par une équipe de plus de deux cent
spécialistes organisés en onze sections. Ce texte comprend 31 tomes d'environ
1200 pages chacun et fut, d'après @jacquet_pfau2015 la dernière entreprise
encyclopédique française majeure à marcher dans les traces de l'ancêtre
prestigieux que fut l'*EDdA*, publiée environ 130 ans plus tôt.
Le titre complet de l'œuvre, déjà, montre sa volonté de filiation avec l'*EDdA*,
volonté d'actualiser EDdA [@jacquet_pfau_actualiser_2022].
#### Versions disponibles
Une version numérique de cette œuvre a été réalisée par la BnF et mise en
ligne[^LGE-V1] en 2007. Basée sur un réimpression non-datée de l'édition
originale, elle comprend une image par page de l'œuvre, numérisée en niveau de
gris à une résolution de 300x300 pixels par pouce. De ces fichiers images a été
tirée une version partielle du texte par application d'un programme de
reconnaissance optique de caractères ([@=OCR]). Cette version présente un
certains nombre de limite qui empêchait de mener une étude intégrale du texte
par des moyens automatiques comme la textométrie.
[^LGE-V1]: [http://catalogue.bnf.fr/ark:/12148/cb377013071](http://catalogue.bnf.fr/ark:/12148/cb377013071)
D'abord, le texte est incomplet: si un grand nombre de tomes ont été OCRisés,
certains comme par exemple les tomes 5 ou 18 n'ont pas été traités et aucun
texte n'est disponible pour ces volumes sur le site de
Gallica[^LGE-V1-liste-des-tomes]. Cet état rend impossible une étude exhaustive
mais rien ne suggère qu'une étude basée sur les tomes disponibles serait sujette
à un biais particulier, puisque les tomes OCRisés ne semblent pas avoir été
choisis selon une logique précise, ceux qui manquent ne sont pas par exemple pas
contigus ni au début ni à la fin de l'œuvre. Ensuite, cette version en «texte
brut» (en réalité une page HTML avec un balisage minimal) ne comporte qu'une
annotation très superficielle et n'est en particulier par segmentée en article.
Cela est un obstacle majeur à l'emploi que nous voulons en faire, puisque
l'unité d'étude de notre corpus est l'article, permettant aussi bien de mener
une étude contrastive en groupant les articles par domaine de connaissance ou
par auteur que d'observer la structure des domaines en comparant entre deux
encyclopédies quels articles ont été conservés ou non, et le cas échéant si le
domaine de connaissance qui leur est associé est le même. Enfin, des erreurs
dans la détection de l'organisation de la page ([@=OLR]) obscurcissent
significativement le texte en opérant des permutations locales de son contenu
qui viennent parfois mélanger des morceaux d'articles entre eux — ce qui
complique nettement la segmentation du texte en article — et dans tous les cas
endommager la structure des phrases, ce qui est vient introduire des erreurs
dans les phases ultérieures d'annotation morpho-syntaxiques et syntaxiques que
nous avons besoin d'appliquer au texte pour faire de la textométrie.
[^LGE-V1-liste-des-tomes]: [https://gallica.bnf.fr/ark:/12148/bpt6k246407#](https://gallica.bnf.fr/ark:/12148/bpt6k246407#)
Dans le but de pallier à ces défauts, le projet CollEx Persée
DISCO-LGE[^DISCO-LGE] a entrepris de renumériser cette encyclopédie en
partenariat avec la BnF et d'en produire une version encodée en XML-TEI. Cette
nouvelle version a été réalisée à partir de photographies d'un exemplaire
original[^LGE-V2] situé à la Bibliothèque de l'Arsenal à Paris[^bib-arsenal].
[^DISCO-LGE]: [https://www.collexpersee.eu/projet/disco-lge/](https://www.collexpersee.eu/projet/disco-lge/)
[^LGE-V2]: [https://catalogue.bnf.fr/ark:/12148/cb41651490t](https://catalogue.bnf.fr/ark:/12148/cb41651490t)
[^bib-arsenal]: [https://www.bnf.fr/fr/arsenal](https://www.bnf.fr/fr/arsenal)
Ce projet a pris fin en août 2021 et a abouti à la publication sur Nakala[^nakala],
le dépôt de données de la Très Grande Infrastructure de Recherche Huma-Num,
d'une nouvelle version de l'œuvre sous différents formats.
[^nakala]: [https://nakala.fr/](https://nakala.fr/)
#### Encodage
##### Structure du module *dictionaries*
**Definitions**
By iterating several times the operation of moving on that graph along one edge,
that is, by considering the transitive closure of the relation "be connected by
an edge" we define *inclusion paths* which allow us to explore which elements
may be nested under which other.
The nodes visited along the way represent the intermediate XML elements to
construct a valid XML tree according to the TEI schema. Given the top-down
semantics of those trees, we call the length of an inclusion path its *depth*.
The ability for an element to contain itself corresponds directly to loops on
the graph (that is an edge from a node to itself) as can be illustrated by the
`<abbr/>` element: an `<abbr/>` element (abbreviation) can directly contain
another one.
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle and we may be tempted in our context to refine this and
name them *inclusion cycles*. The `<address/>` element provides us with an
example for this configuration: although an `<address/>` element may not
directly contain another one, it may contain a `<geogName/>` which, in turn, may
contain a new `<address/>` element. From a graph theory perspective, we can say
that it admits an inclusion cycle of length two.
**Applications**
Using classical, well-known methods such as Dijkstra's algorithm [@dijkstra59]
allows us to explore the shortest inclusion paths that exist between elements.
Though a particular caution should be applied because there is no guarantee that
the shortest path is meaningful in general, it at least provides us with an
efficient way to check whether a given element may or not be nested at all under
another one and gives a lower bound on the length of the path to expect. Of
course the accuracy of this heuristic decreases as the length of the elements
increases in the perfect graph representing the intended, meaningful path
between two nodes that a human specialist of the TEI framework could build.
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
elements which cross module boundaries freely. The general graph formalism
enables us to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.
For instance, it lets one find that although `<pos/>` may not be directly
included within `<entry/>` elements to include information about the
part-of-speech of the word that an article defines, the correct way to do so is
through a `<form/>` or a `<gramGrp/>`.
On the other hand, trying to discover the shortest inclusion path to `<pos/>`
from the `<TEI/>` root of the document yields a `<standOff/>`, an element
dedicated to store contextual data that accompanies but is not part of the text,
not unlike an annex, and widely unrelated to the context of encoding an
encyclopedia.
A last relevant example on the use of these methods can be given by querying the
shortest inclusion path of a `<pos/>` under the `<body/>` of the document: it
yields an inclusion directly through `<entryFree/>` (with an inclusion path of
length 2), which unlike `<entry/>` accepts it as a direct child node. Possibly
not what we want depending on the regularity of the articles we are encoding and
the occurrence of other grammatical information such as `<case/>` or `<gen/>` to
justify the use of the `<gramGrp/>`, but searching exhaustively for paths up to
length 3 returns as expected the path through `<entry/>`, among others. Overall,
we get a good general idea: `<pos/>` does not need to be nested very deep, it
can appear quite near the "surface" of article entries.
##### Limites
###### The `<entry/>` element
The central element of the *dictionaries* module is the `<entry/>` element meant
to encode one single entry in a dictionary, that is to say a head word
associated to its definition. It is the natural way in from the `<body/>`
element to the dictionary module: indeed, although `<body/>` may also contain
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
`<entry/>` while the latter is a device to group several related entries
together. Both can contain an `<entry/` directly while no obvious inclusion
exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which we define as strictly inferior to 5, that is twice the
average shortest depth between any two nodes) either include `<figure/>` or
`<castList/>`, two very specific elements which should not need to appear in an
article in general, showing that the purpose of `<entry/>` is not to contain an
`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
documentation but also the structure of the elements graph evidence `<entry/>`
as the natural top-most element for an article. This somewhat contrived example
hopes to further demonstrate the application of a graph-centred approach to
understand the inner workings of the XML-TEI schema.
###### Information about the headword itself
Once a block for an article is created, it may contain elements useful to
represent various of its features. Its written and spoken forms are usually
encoded by `<form/>` elements. Grammatical information like the `<case/>`,
`<gen/>` or `<number/>` and `<pers/>` can be contained within a `<gramGrp/>`,
along with information about the categories it belongs to like `<iType/>` for
its inflection class in languages with a declension system or `<pos/>` for its
part-of-speech. The `<etym/>` element is made to hold the etymology of an entry.
In the case when there are alternative spellings in varieties of the language or
if the spelling has changed over time, `<usg/>` can be used.
All these examples are by no means an exhaustive list; the complete set provides
the encoder with a toolbox to describe all the information related to the form
the entry is found at and seems general enough to accomodate the structure of
any book indexing entries by words.
###### Cross-references
A common feature shared by dictionaries and encyclopedias is the ability to
connect entries together by using a word or short phrase as the link, referring
the reader to the related concept. This is known as cross-references and can
appear either when the definition of a term is adjacent to another one or to
catch alternative spellings where some readers might expect to find the word and
redirect them to the form chosen as the reference. In XML-TEI, this is done with
the `<xr/>` element. It usually contains the whole phrase performing the
redirection, with an imperative locution like "please see […]".
The "active" part of the cross-reference, that is the very word within the
`<xr/>` that is considered to be the link or, to make a modern-day HTML
metaphor, the region that would be clickable, is represented by a `<ref/>`
element. Though it is not specific to the *dictionaries* module, we include it
in this description of the toolbox because it is particularly useful in the
context of dictionaries. This element may have a target attribute which points
to the other resource to be accessed by the interested reader.
###### Definitions
The remaining part of entries is also usually the largest and represents the
content associated to the headword by the entry. In a dictionary, that is its
meaning.
The `<sense/>` element is a valid child for `<entry/>` and groups together a
definition of the term with `<def/>`, usage examples with `<usg/>` (another use
of this versatile element) and other high-level information such as translations
in other languages. Both `<def/>` and `<usg/>` elements may appear directly
under the `<entry/>`.
###### Structural remarks
Before concluding this description of the *dictionaries* module from the
perspective of someone trying to concretely encode a particular dictionary or
encyclopedia, we make use of the graph approach again to evidence some its
aspects in terms of inclusion structure.
First, it is remarkable that all elements in the *dictionaries* module have a
cyclic inclusion path, that is to say, there is an inclusion path from each
element of this module to itself. Although having such a cycle is a widespread
property in the remainder of XML-TEI elements shared by 73.8% of them (411 out
of the 557 elements in the other modules), all 33 elements of the *dictionaries*
module having one is far above this average. In addition, the cycles appear to
be rather short, with an average length of 2.00 versus 2.50 in the rest of the
population. This observation is all the more surprising considering the fact
that the *dictionaries* module contains short "leaf" elements like `<pos/>`
which should not obviously need to admit cycles since one rather expects them to
contain only one word, like `<pos>adj</pos>` in the example given in the
official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the *core* module (see the case of the
`<ref/>` element above), the *dictionaries* module appears somewhat isolated
from important structural elements like `<head/>` or `<div/>`. Indeed, computing
all the paths from either `<entry/>` or `<sense/>` elements to the latter of
length shorter or equal to 5 by a systematic traversal of the graph yields
exclusively paths (respectively 9042 and 39093 of them) containing either a
`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
suggests, is used to encode text that does not quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a `<body/>` as direct child of
`<floatingText/>`, neatly separating its content as independent. The purpose of
the second one, although its name — short for apparatus — is less clear, is to
wrap together several versions of the same excerpts, for instance when there are
several possible readings of an unclear group of words in a manuscript, or when
the encoder is trying to compile a single version of a piece of work from
several sources which disagree over some passage. In both case, it appears
obvious that it is not something that is expected to occur naturally in the
course of an article in general.
Thus, despite a rather dense internal connectivity, the *dictionaries* module
fails to provide encoders with a device to represent recursively nesting
structures like `<div/>`.
The situation regarding subject indicators is hardly better outside of the
module. The `<domain/>` element despite its name belongs exclusively in the
header of a document and focuses on the social context of the text, not on the
knowledge area it covers. The `<interp/>` despite its name is not so much about
labeling something as an interpretation to give to a context (which subject
indicators could be if you consider that, placed at the beginning, they are used
to direct the mind frame of the readers towards a particular subject). However,
the documentation clearly demonstrates it as a tool for annotators of a
document, which text content is not part of the original document but some
additional result of an analysis performed in the context of the encoding, used
only throughout references in XML attributes.
This point, although not the most concerning, still remains the hardest to
address but all things considered the `<usg/>` element stands out as the most
relevant.
###### The notion of meaning
Notwithstanding the correct way to represent domains of knowledge, their extent
itself raises concerns regarding the *dictionaries* module. Indeed, among the
vast collection of domains covered in encyclopedias in general and in *La Grande
Encyclopédie* in particular are historical articles and biographies. If the
notion of meaning can appear at least ill-fitting for a text describing a series
of historical events, one may still argue that it groups them into a concept and
associates it to the name of the event. But when it comes to relating the life
of a person, describing their relation to events and other persons comes out
even further from the notion of meaning. Entries such as the one about SANJO
Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29](figure/article/LGE/sanjo_t29.png){#fig:sanjo}
Moreover, encyclopedias, because of all that they have inherited from the
philosophical Enlightenment, are not only spaces designed to assert, they also
intrinsically include an interrogative component. Some articles lay down the
basis required to understand the complexity of an issue and invite the reader to
consider it without providing a definitive answer, going as far as to explicitly
use question marks as in the article "Action" displayed in Figure @fig:action.
![Excerpt from article "Action", in La Grande Encyclopédie, tome 1](figure/article/LGE/action_t1.png){#fig:action}
In this extract, the author devises a hypothetical situation to illustrate how
difficult it is to draw the line between two supposedly mutually exclusive
subcategories of legal actions. The whole point of the passage is to convey the
idea that the term eludes definition, wrapping it in a `<sense/>`, or worse, a
`<def/>` element would be an utter misnomer.
As a result, the use of `<sense/>` and `<def/>` is not appropriate for
encyclopedic content in general.
###### Nested structures
The final difficulty can be considered as a partial consequence of the previous
one on the structure of articles. The difficulty to define complex concepts is
the very reason why authors approach their subjects from various angles,
circumnavigating it as a best approximation. This strategy favours long,
structured developments with sections and subsections covering the multiple
aspects of the topic: from a historical, political, scientific point of view…
The longest articles, such as article "Europe" shown in Figure @fig:europe, can
thus span several dozens of pages. They can contain substructures with titles on
at least three levels (for instance, a `a)` under a `1)` under a `I.`), each of
which are in turn generally developed over several paragraphs.
![La Grande Encyclopédie, tome 16, article "Europe", spanning from p.782 to p.846, that is 64 pages, and ending after a bibliography longer than one column of text](figure/article/LGE/europe_t16.png){#fig:europe}
The nested structure that we have just evidenced demands of course a nesting
structure to accomodate it. More precisely it guides our search of XML elements
by giving us several constraints: we are looking for a pair of elements, the
first representing a (sub)section must be able to include both itself and the
second element, which does not have any special constraint except the one to
have a semantics compatible with our purpose of using it to represent section
titles. In addition, the first element must be able to contain several `<p/>`
elements, `<p/>` being the reference element to encode paragraphs according to
the XML-TEI documentation.
We have seen that the *dictionaries* module was equiped with a questionable but
possible element for subject domains. However, it does not include any element
for section titles. In the rest of the TEI specification, the elements `<head/>`
and `<title/>` — the latter with the possibility to set its `type` attribute to
`sub` — stand out as the best candidates for the semantics condition on the
second element.
##### Choix
###### Candidates in the *dictionaries* module
Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
It is remarkable that even replacing the `<entry/>` element for the root of each
article with an `<entryFree/>`, an element supposed to relax some constraint to
accomodate more unusual structure in dictionaries does not bring any
improvement.
The lack of results from these simple queries forces us to somewhat release the
constraints on the encoding we are willing to use. We can for instance make the
asumption that the occurrence of an intermediate element could be needed between
the element wrapping the whole article and the recursing one used to encode each
section. This "section" element could also need a companion element to be able
to include itself, or, to formalise it in terms of graph theory, we could relax
the condition that this element admits a loop to consider instead cycles of a
given (small, this still needs to represent a fairly direct inclusion) length to
be enough. We simultaneously extend the maximum depth of the inclusion paths we
are looking for between `<entry/>`, the pair of elements and the `<p/>` element.
By setting this depth to 3, that is, by accepting one intermediate element to
occur in the middle of each one of the inclusion paths that define the structure
required to encode encyclopedic discourse, we find 21 elements but none of them
stand out as an obvious good solution: all paths to include the `<p/>` element
from any *dictionaries* element either contains a `<figure/>` (which we have
encountered earlier when we were practising our graph approach to search for
inclusions between `<entry/>` and `<entryFree/>` and dismissed as not useful in
general), a `<stage/>` (reserved to stage direction in dramatic works) or a
`<state/>` (used to describe a temporary quality in a person or place), again
not even close to what we want. The paths to either `<head/>` or `<title/>` are
similarly disappointing. Again, changing `<entry/>` for `<entryFree/>` returns
the exact same candidates. If that is not a thorough proof that none of these
elements could fulfill our purpose, it is a fact than no element in this module
appears as an obvious good solution and a serious hint to keep looking somewhere
else.
###### Widening the search
We hence widen our search to include elements outside the *dictionaries* module
which could be used to encode our sections and subsections, under the same
constraint as before to try and find a composite solution that would remain
under the `<entry/>` element even if resorting to subcomponents outside of the
dedicated module. Only three elements are returned: `<figure/>`, `<metamark/>`
and `<note/>`.
The first one as we have repeatedly underlined is meant for graphic information
and is not suitable for text content in general.
The purpose of `<metamark/>` is to transcribe the edition marks than may appear
on a particular primary source in order to alter the normal flow of the text and
suggest an alternative reading (deletion, insertion, reordering, this is about a
human editing the text from a given physical copy of it), but it is
unfortunately of no use to encode a section of an article.
The first element that might at least resemble what we are looking for is the
last one, `<note/>`. It is meant to contain text, is about explaning something
and seems general enough (not specific to a given genre, or to the occurrence of
a particular object on the page). Unfortunately, its semantics still seems a bit
off compared to our need. The documentation describes it as an "additional
comment" which appears "out of the main textual stream" whereas the long
developments in articles are the very matter of the text of encyclopedias, not
mere remarks in the margins or at the foot of pages.
##### Implémentation
The above remarks explain why the *dictionary* module is unable to represent
encyclopedias, where the notion of "meaning" is less central that in
dictionaries and where discourse with nested structures of arbitrary depth can
occur. Even composite encodings using elements outside of the *dictionaries*
module under an `<entry/>` element do not meet our requirements. Since the
*core* module of course accomodates these structures by means of the `<div/>`,
`<head/>` and `<p/>` elements which have the additional advantage of carrying
less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
using them which we recommend using for other projects aiming at representing
encyclopedias.
To remain consistent with the above remarks we will only concern ourselves with
what happens at the level of each article, right under the `<body/>` element.
Everything related to metadata happens as expected in the file's `<teiHeader/>`
which is well-enough equiped to handle them. In order to present our scheme
throughout the following section we will be progressively encoding a reference
article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
![La Grande Encyclopédie, tome 9, article "Cathète"](figure/article/LGE/cathète_t9.png){#fig:cathete-photo}
###### The scheme
Remaining within the *core* module for the structure, almost all useful elements
are available and our encoding scheme merely quotes the official documentation.
Each article is represented by a `<div/>`. We suggest setting an `xml:id`
attribute on it with the head word of the entry — unique in the whole corpus, or
made so by suffixing a number representing its rank among the various
occurrences, even when there's only one for the sake of regularity — as its
value, normalised to lowercase, stripping spaces and replacing all
non-alphanumerical characters by a dash (`'-'`) to avoid issues with the XML
encoding. Figure @fig:cathete-xml-0 illustrates this choice for the container
element on the article "Cathète" previously displayed.
![The container `div` element for article "Cathète"](figure/article/LGE/cathète_0.png){#fig:cathete-xml-0}
Inside this element should be a `<head/>` enclosing the headword of the article.
The usual sub-`<hi/>` elements are available within `<head/>` if the headword is
highlighted by any special typographic means such as bold, small capitals, etc.
The one disappointment of the encoding scheme we are defining in this chapter is
the lack of support for a proper way to encode subject indicators.
The best candidate we have found so far was `<usg/>` from the *dictionaries*
module but it is not available directly under a `<head/>` element. All inclusion
paths from the latter to the former of length less than or equal to 3 contain
irrelevant elements (`<cit/>`, `<figure/>`, `<castList/>` and `<nym/>`) so it
must be discarded. The next best elements appear to be `<term/>` (not very
accurate) and `<rs/>` ("referring string", quite a general semantics but a
possible match — subject indicators refer to a given domain of knowledge —
although all the examples in the documentation refer to concrete persons,
places or object, not to the abstract objects that mathematics or poetry are).
For this reason, we do not recommend any special encoding of the subject
indicator but leave it open to each particular context: they are often
abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
are not labeled by a knowledge domain but usually include the first name of the
person when it is known so in that case an element like `<persName/>` is still
appropriate. This choice applied to the same article "Cathète" produces Figure
@fig:cathete-xml-1.
![Encoding the head word of article "Cathète"](figure/article/LGE/cathète_1.png){#fig:cathete-xml-1}
We then propose to wrap each different meaning in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. The `<div/>`s should be numbered
according to the order they appear in with the `n` attribute starting from `0`
as shown in Figure @fig:cathete-xml-2.
![The empty structure for the only meaning of the word "Cathète"](figure/article/LGE/cathète_2.png){#fig:cathete-xml-2}
In addition, each line within the article must start with a `<lb/>` to mark its
beginning including before the `<head/>` element as demonstrated by Figure
@fig:cathete-xml-3, which, although a surprising setup, underlines the fact that
in the dense layout of encyclopedias, the carriage return separating two
articles is meaningful. Stating each new line explicitly keeps enough
information to reconstruct a faithful facsimile but it also has the advantage of
highlighting the fact than even though the definition is cut from the headword
by being in a separate XML element, they still occur on the same line, which is
a typographic choice usually made both in encyclopedias and dictionaries where
space is at a premium. .
To complete the structure, the various sections and subsections occurring
within the article body may be nested as usual with `<div/>` and sub-`<div/>`s,
filled with `<p/>` for paragraphs which can each be titled with `<head/>`
elements local to each `<div/>`.
![A complete encoding of article "Cathète"](figure/article/LGE/cathète_3.png){#fig:cathete-xml-3}
Some articles such as "Boumerang" have figures with captions, as illustrated by
Figure @fig:boumerang-photo, which should be encoded the standard way by
`<figure/>` and `<figDesc/>` as in Figure @fig:boumerang-xml.
![La Grande Encyclopédie, tome 7, article "Boumerang"](figure/article/LGE/boumerang_t7.png){height=300px #fig:boumerang-photo}
![Encoding the figure in article "Boumerang" and its captions](figure/article/LGE/boumerang.png){#fig:boumerang-xml}
Another issue arising from giving up on `<entry/>` is the unavailability of the
`<xr/>` element, not allowed under any of the *core* elements we use but which
is useful to represent cross-references occurring in encyclopedias as well as in
dictionaries, for example in article "Gelocus" (see Figure @fig:gelocus-photo).
We prefer to use the `<ref/>` element instead which is available in the context
of a `<p/>`. Its `target` attribute should be set to the `xml:id` of the
article it points to, prefixed with a `'#'` as shown in Figure @fig:gelocus-xml.
Another solution would have been to introduce a `<dictScrap/>` element for the
sole purpose of placing an `<xr/>` but we advocate against it on account of the
verbosity it would add to the encoding and the fact that it implicitly suggests
that the previous context was not the one of a dictionary.
![La Grande Encyclopédie, tome 18, article "Gelocus"](figure/article/LGE/gelocus_t18.png){#fig:gelocus-photo}
![Encoding the cross-references in article "Gelocus"](figure/article/LGE/gelocus.png){#fig:gelocus-xml}
A typical page of an encyclopedia also features peritext elements, giving
information to the reader about the current page number along with the headwords
of the first and last articles appearing on the page. Those can be encoded by
`<fw/>` elements ("forme work") which `place` and `type` attributes should be
set to position them on the page and identify their function if it has been
recognised (those short elements on the border of pages are the ones typically
prone to suffer damages or be misread by the OCR).
Finally there are other TEI elements useful to represent "events" in the flow of
the text, like the beginning of a new column of text or of a new page. Figure
@fig:alcala-photo shows the top left of the last page of the first tome of *La
Grande Encyclopédie* which features peritext elements while marking the
beginning of a new page. The usual appropriate elements (`<pb/>` for page
beginning, `<cb/>` for column beginning) may and should be used with our
encoding scheme as demonstrated by Figure @fig:alcala-xml.
![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](figure/article/LGE/last_page_top_left_t1.png){width=350px #fig:alcala-photo}
![Encoding the beginning of a page in article "Alcala-de-Hénarès"](figure/article/LGE/alcala.png){#fig:alcala-xml}
###### Currently implemented
The reference implementation for this encoding scheme is the program
soprano[^soprano] developed within the scope of project DISCO-LGE to
automatically identify individual articles in the flow of raw text from the
columns and to encode them into XML-TEI files. Though this software has already
been used to produce the first TEI version of *La Grande Encyclopédie*, it does
not yet follow the above specification perfectly. Figure
@fig:cathete-xml-current shows the encoded version of article "Cathète" it
currently produces:
[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
![The current encoding of article "Cathète" produced by `soprano`](figure/article/LGE/cathète_current.png){#fig:cathete-xml-current}
The headword detection system is not able to capture the subject indicators yet
so it appears outside of the `<head/>` element. No work is performed either to
expand abbreviations and encode them as such, or to distinguish between domain
and people names.
Likewise, since the detection of titles at the beginning of each section is not
complete, no structure analysis can be performed at the moment on the textual
development inside the article and it is left unstructured, directly under the
entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
paragraphs are not yet identified and for this reason not encoded.
However, the figures and their captions are already handled correctly when they
occur. The encoder also keeps track of the current lines, pages, and columns and
inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
numbers pages so that the numbering corresponding to the physical pages are
available, as compared to the "high-level" pages numbers inserted by the
editors, which start with an offset because the first, blank or almost empty
pages at the beginning of each book do not have a number and which sometimes have
gaps when a full-page geographical map is inserted since those are printed
separately on a different folio which remains outside of the textual numbering
system. The place at which these layout-related elements occur is determined by
the place where the OCR software detected them and by the reordering performed
by `soprano` when inferring the reading order before segmenting the articles.
###### The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *Encyclopédie* comprises
over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
version produced by `soprano` created 160k articles, but their segmentation is
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an overestimation of the total number).
XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
information (for the second one, adjacent to a notion as elusive as truth)
which requires a very deep understanding of a text in its entirety and about
which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented above
contain information about the positions of the elements or their relation to one
another. Those with an additional semantics implication like `<head/>` can be
inferred simply from their position and the frequent use of a special typography
like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In *La Grande
Encyclopédie*, virtually all the redirections (that is, to the extent of our
knowledge, absolutely all of them though of course some special case may exist,
but they are statistically rare enough that we have not found any yet) appear
within parenthesis, and start with the verb "voir" abbreviated as a single,
capital "V." as illustrated above in the article "Gelocus" (see again Figure
@fig:gelocus-photo).
Although this has not been implemented yet either, we hope to be able to detect
and exploit those patterns to correctly encode cross-references. Getting the
`target` attributes right is certainly more difficult to achieve and may require
processing the articles in several steps, to first discover all the existing
headwords — and hence article IDs — before trying to match the words following
"V." with them. Since our automated encoder handles tomes separately and since
references may cross the boundaries of tomes, it cannot wait for the target of a
cross-reference to be discovered by keeping the articles in memory before
outputting them.
This is in line with the last important aspect of our encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can, for
instance the empty elements already discussed above. For articles, it pushes
lines onto a stack and flushes it each time it encounters the beginning of the
following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over three minutes per tome, the total processing time can be
lowered to around forty minutes on a machine with 16Go of RAM for the whole of
*La Grande Encyclopédie* instead of over one hour and a half.
#!/bin/sh
source ./chapter.sh 'Préparation et enrichissement du corpus'
cat Corpus/Formats_et_états.md
cat Corpus/Domaines.md
cat Corpus/Annotation.md
OCR
: *Optical Character Recognition*, reconnaissance optique de caractères, est
le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de
caractères compréhensibles par la machine et traitables ensuite par des moyens
automatiques, à partir d'une image.
# Glossaire {-}
OCR
: *Optical Character Recognition*, reconnaissance optique de caractères, est
le procédé par lequel un logiciel extrait du texte, c'est à dire une suite de
caractères compréhensibles par la machine et traitables ensuite par des moyens
automatiques, à partir d'une image.
OLR
: *Optical Layout Recognition*, reconnaissance optique de la disposition de la
......
#!/bin/sh
[ -n "${HEADER_INCLUDED}" ] || source ./header.sh 2
echo '# Glossaire {-}'
cat Glossaire/OCR.md
cat Glossaire/OLR.md
## Tracer le contours de la géographie
### Établir une correspondance
Empiriquement:
+ avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés
+ machine learning
### La biographie cachée
## Entités Nommées Étendues
Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022
### Travaux sur les GNNs
Qu'est-ce qu'on en a retiré ?
# Identifier et problématiser la géographie
## Relation entre spatial et géographique
-> questionnement d'une frontière même
(structuration de la géographie)
## Tracer le contours de la géographie
### Établir une correspondance
Empiriquement:
+ avec TXM, Lexicoscope, etc., est-ce qu'on voit les mêmes propriétés
+ machine learning
### La biographie cachée
## Variété des genres discursifs au sein des articles
## Relations entre les domaines de connaissances
### Erreurs de classification
......@@ -735,11 +714,4 @@ differences we have underlined show that size alone cannot explain their
distribution in detail. The model does seem to identify some classes
more easily because of distinctive lexical patterns.
## Entités Nommées Étendues
Intro Édl'A: c'est coûteux d'annoter @ortiz_suarez_data-driven_2022
### Travaux sur les GNNs
Qu'est-ce qu'on en a retiré ?
## Relation entre spatial et géographique
-> questionnement d'une frontière même
(structuration de la géographie)
## Variété des genres discursifs au sein des articles
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment