Skip to content
Snippets Groups Projects
Commit 34300a87 authored by Alice Brenon's avatar Alice Brenon
Browse files

Fixed the begining it seems

parent 68acc9d8
No related branches found
No related tags found
No related merge requests found
......@@ -57,8 +57,8 @@ arts et des métiers* (hence *EDdA*) by @dalembert_dictionnaire_2022 [article
DICTIONNAIRE, volume 4] who opposes three kinds of dictionaries: one to define
*words*, the second to define *facts* and the last one to define *things*,
corresponding respectively to language, history, and science and arts
dictionaries. The first type corresponds to our modern dictionaries while the
two others are similar to what one expects to find in an encyclopedia.
dictionaries. The first type corresponds to modern dictionaries while the two
others are similar to what one expects to find in an encyclopedia.
However, d'Alembert himself doesn't think of these boundaries as absolute and he
hints at the extreme difficulty in merely defining words without going into
......@@ -67,8 +67,8 @@ semantics and philosophical considerations:
> un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit
> être souvent un dictionnaire de choses quand il est bien fait
(*a language dictionary, which appears to be only a word dictionary, must often
be a thing dictionary when it is made properly*). A similar criticism is made by
("a language dictionary, which appears to be only a word dictionary, must often
be a thing dictionary when it is made properly"). A similar criticism is made by
@haiman_dictionaries_1980 [p. 331] who attacks no less than six criteria on
which dictionaries and encyclopedias are generally opposed to reach the
conclusion that there is no distinction between them because "dictionaries *are*
......@@ -107,6 +107,11 @@ the possibilities and shortcomings of the TEI *dictionaries* module.
# Context of the study
To give a better understanding of this research, this section describes
the aims of the project from which it stems before giving a short history of the
term *encyclopedia* and underlining the known differences between dictionaries
and encyclopedias which constitute the starting point of this investigation.
## CollEx-Persée Project DISCO-LGE
The project
......@@ -116,13 +121,13 @@ Lettres et des Arts par une Société de savants et de gens de lettres* (hence
*LGE*), an encyclopedia published in France between 1885 and 1902 by an
organised team of over two hundred specialists divided into eleven sections.
This text comprises 31 tomes of about 1200 pages each and according to
@jacquet-pfau2015 [, pp. 88 et seq.] was the last major french encyclopedic
@jacquet-pfau2015 [pp. 88 et seq.] was the last major french encyclopedic
endeavour directly inheriting from the prestigious ancestor that was the *EDdA*
published by Diderot and d'Alembert 130 years earlier, between 1751 and 1772.
The aim of the project was to digitise and make *La Grande Encyclopédie*
available to the scientific community as well as the general public. A previous
version of this encyclopedia was partially available on Gallica
The aim of the project was to digitise and make *LGE* available to the
scientific community as well as the general public. A previous version of this
encyclopedia was partially available on Gallica
([https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&collapsing=disabled&query=dc.relation%20all%20%22cb377013071%22](https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&collapsing=disabled&query=dc.relation%20all%20%22cb377013071%22))
but lacked in quality and its text had not been fully extracted from the
pictures with an Optical Characters Recognition (OCR) system. This prevented an
......@@ -130,20 +135,18 @@ exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As
a prelude to project GEODE
([https://geode-project.github.io/](https://geode-project.github.io/)), the goal
of CollEx-Persée was to produce a digital version of *LGE* with a quality
comparable to the one of l'*Encyclopédie* provided by the ARTFL
comparable to the one of l'*EDdA* provided by the ARTFL
([http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/))
project in order to conduct a diachronic study of both encyclopedias.
## *Encyclopedia*
In common parlance, the terms "dictionaries" and "encyclopedias" are used as
near synonyms to refer to books compiling vast amounts of knowledge into lists
of definitions ordered alphabetically. Their similarity is even visible in the
way they are coordinated in the full title of the *Encyclopédie* which is
probably the most famous work of the genre and a symbol of the Age of
Enlightenment. If the word "encyclopedia" is nowadays part of everyday
vocabulary, it was much more unusual and in fact controversial when Diderot and
d'Alembert decided to use it in the title of their book.
If the word "encyclopedia" is now part of everyday vocabulary and has a slightly
different meaning from dictionary, it was much more unusual and in fact
controversial when Diderot and d'Alembert decided to use it in the title of
their book, while having to coordinate them both in the full title of the *EDdA*
which is probably the most famous work of the genre and a symbol of the Age of
Enlightenment.
The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
......@@ -166,16 +169,16 @@ that intent, he quotes a poem from Pibrac encouraging people to specialise in
only one discipline lest they should not reach perfection, based on an
argumentation that resembles the saying "Jack of all trades, master of none". It
is all the more interesting that the definition remains unaltered until 1752,
one year after the publication of the first volume of the *Encyclopédie*. The
one year after the publication of the first volume of the *EDdA*. The
Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
*Encyclopédie* which they managed to get banned the same year by the Council of
*EDdA* which they managed to get banned the same year by the Council of
State on the charge of attempting to destroy the royal authority, inspiring
rebellion and corrupting morality in general. There is much more at stake than
words here, but the attempt to deprecate the word itself is part of their fight
against the philosophers of the Enlightenment.
The attacks do not remain ignored by Diderot who starts the very definition of
the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
the word "Encyclopédie" in the *EDdA* itself by a strong rebuttal. He
directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
mere self-doubt that their authors should not generalise to anyone, then leaves
the main point to a latin quote by chancelor Bacon [@lojkine2013, p. 5], who argues
......@@ -185,7 +188,7 @@ lifetime may be achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defence of the feasability of
the project quite seriously, considering the fact that they got the
*Encyclopédie*'s privileges to be revoked again six years after its publication
*EDdA*'s privileges to be revoked again six years after its publication
was resumed [@moureau2001]. As a consequence, the remaining ten volumes
containing the text of the articles had to be published illegally until 1765,
thanks to the secret protection of Malesherbes who — despite being head of royal
......@@ -215,29 +218,20 @@ If encyclopedias are thus historically more recent than dictionaries they also
depart from the latter on their approach. The purpose of dictionaries from their
origin is to collect words, to make an exhaustive inventory of the terms used in
a domain or in a language in order to associate a *definition* to them, be it a
translation in another language for a foreign language dictionary or a phrase
explaining it for other dictionaries. As such, they are collections of *signs*
and remain within the linguistic level of things. Entries in a dictionary often
feature information such as the part of speech, the pronunciation or the
etymology of the word they define.
# <FIXME
The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
types of dictionaries: one to define *words*, the second to define *facts* and
the last one to define *things*, corresponding to the distinction between
language, history, and science and arts dictionaries although according to its
author, d'Alembert, each has to be of more than just one kind to be really good.
In the full title of the *Encyclopédie*, the concept is more or less equated by
means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*,
"reasoned dictionary", introducing the idea of encyclopedias as dictionaries
with additional structure and a philosophical dimension.
# FIXME>
phrase explaining it or a translation in another language for a foreign language
dictionary. As such, they are collections of *signs* and are more concerned with
the linguistic level of things. Entries in a dictionary often feature
information such as the part of speech, the pronunciation or the etymology of
the word they define.
In the full title of the *EDdA*, the concept of encyclopedia is more or less
equated by means of the coordinating conjunction "ou" to a *Dictionnaire
raisonné*, "reasoned dictionary", introducing the idea that encyclopedias are
dictionaries with some additional structure and a philosophical dimension.
Back to the "Encyclopédie" article one can read that a dictionary remaining
strictly at the language level, a vocabulary, can be seen as the empty frame
required for an encyclopedic dictionary that will fill it with additional depth.
required for an encyclopedic dictionary which will fill it with additional depth.
Given how d'Alembert insists on the importance of brevity for a clear definition
in the "Dictionnaire de Langues" entry, it is clear that the *encyclopédistes*
did not consider encyclopedias superior to dictionaries but really as a new
......@@ -245,33 +239,20 @@ subgenre departing from them in terms of purpose.
# The *dictionaries* TEI module {#sec:dictionaries-module}
# <FIXME
The XML-TEI toolbox has a modular structure consisting of optional parts each
covering specific needs such as the physical features of a source document, the
transcription of oral corpora or particular requirements for textual domains
like poetry, or, in the case at hand, dictionaries. After describing why the dedicated
module was a natural candidate to consider, I formalise tools from graph
theory to browse the specifications of this guideline in a rational way and
explore this module in detail.
# FIXME>
One of the main motivation behind project DISCO-LGE was to produce data useful
to future scientific projects, which in particular requires it to be
*interoperable* and *reusable*. These are the two last key aspects of the FAIR
([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/))
principles (*findability*, *accessibility*, *interoperability* and
*reusability*) which are important guideline for efficient, high-quality
research. The XML-TEI guidelines provide tools to achieve this goal. This
section therefore starts by describing the existing toolset it provides, before
introducing some notations and tools from graph theory which will be used to
browse the guidelines in a systematic and thorough way in section
@sec:new-standard.
## A good starting point {#sec:starting-point}
Data produced in the context of a project such as DISCO-LGE cannot be useful to
future scientific projects unless it is *interoperable* and *reusable*. These
are the two last key aspects of the FAIR
([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) principles (*findability*,
*accessibility*, *interoperability* and *reusability*) which I strive to follow
as a guideline for efficient and quality research.
# <FIXME
It entails using standard
formats and a standard for encoding historical texts in the context of digital
humanities is XML-TEI, collectively developped by the *Text Encoding Initiative*
consortium which publishes a set of technical specifications under the form of
XML schemas, along with a range of tools to handle them and training resources.
# FIXME>
The *dictionaries* module has been leveraged to encode dictionaries in projects
NENUFAR
([https://cahier.hypotheses.org/nenufar](https://cahier.hypotheses.org/nenufar))
......@@ -291,10 +272,10 @@ reasons, the encoding schemes used in these projects could not be reused
directly, prompting for a systematic exploration of the XML-TEI schema to devise
a new one.
This chapter discusses XML elements in depth and hence needs to name and
manipulate them. They will be represented in a monospace font, in the standard
XML autoclosing form within angle brackets and with a slash following the
element name like `<div/>` for a `div` element
This chapter discusses XML elements and hence needs to name and manipulate them.
They will be represented in a monospace font, in the standard XML autoclosing
form within angle brackets and with a slash following the element name like
`<div/>` for a `div` element
([https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html)).
This notation does not mean to imply that they cannot contain raw text or other
XML elements, it merely denotes such an element, without any additional
......@@ -312,6 +293,11 @@ browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step.
The problem can be advantageously transformed by representing this network as a
graph to benefit from the results of graph theory. Classical, well-known methods
such as Dijkstra's algorithm [@dijkstra59] which computes the shortest path
between two nodes in a graph can then be applied
directed graph, using elements of XML-TEI as nodes and placing edges if the
destination node may be contained within the source node according to the
schema. Please note that the word "element" is here used with the same meaning
......@@ -492,9 +478,9 @@ Thus, despite a rather dense internal connectivity, the *dictionaries* module
fails to provide encoders with a device to represent recursively nesting
structures like `<div/>`.
# A new standard ?
# A new standard ? {#sec:new-standard}
Studying the content of *La Grande Encyclopédie* and considering several
Studying the content of *LGE* and considering several
articles in particular, one can identify structures which are specific to
encyclopedias and not compatible with the *dictionaries* module presented in the
previous section. It follows that this module is not able to encode arbitrary
......@@ -512,11 +498,11 @@ of the great variety in terms of editorial choices the most obvious can be
discussed.
The first immediately visible feature that sets encyclopedias apart from
dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
Encyclopédie* is the presence of subject indicators at the beginning of articles
right after the headword which organise them into a domain classification
system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law.
dictionaries and can be found in the *EDdA* as well as in *LGE* is the presence
of subject indicators at the beginning of articles right after the headword
which organise them into a domain classification system. Those generally cover a
broad range of subjects from scientific disciplines to litterature, and
extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose
of encoding these indicators. As section @sec:dictionaries-module illustrates,
......@@ -530,7 +516,7 @@ context in which the word can be used, as expected from the element's name,
"usage". In encyclopedias, if the domain indicator does in certain cases help to
distinguish between several entries sharing the same headword, the concept
itself has evolved beyond this mere distinction. Looking back at the
*Encyclopédie*, the adjective *raisonné* in the rest of the title directly
*EDdA*, the adjective *raisonné* in the rest of the title directly
introduces a notion of structure that links back to the "Systême figuré des
connoissances humaines" [@blanchard2002, p. 1] which schematic structure is
shown in Figure @fig:systeme-figure. The authors have devised a branching system
......@@ -558,14 +544,14 @@ relevant.
Notwithstanding the correct way to represent domains of knowledge, their extent
itself raises concerns regarding the *dictionaries* module. Indeed, among the
vast collection of domains covered in encyclopedias in general and in *La Grande
Encyclopédie* in particular are historical articles and biographies. If the
notion of meaning can appear at least ill-fitting for a text describing a series
of historical events, one may still argue that it groups them into a concept and
associates it to the name of the event. But when it comes to relating the life
of a person, describing their relation to events and other persons comes out
even further from the notion of meaning. Entries such as the one about SANJO
Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
vast collection of domains covered in encyclopedias in general and in *LGE* in
particular are historical articles and biographies. If the notion of meaning can
appear at least ill-fitting for a text describing a series of historical events,
one may still argue that it groups them into a concept and associates it to the
name of the event. But when it comes to relating the life of a person,
describing their relation to events and other persons comes out even further
from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
Figure @fig:sanjo) do not constitute a *definition*.
![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29 ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/sanjo_t29.png){#fig:sanjo}
......@@ -745,7 +731,7 @@ to the abstract objects that mathematics or poetry are).
For this reason, no particular encoding of the subject indicator is recommended
and it is left open to each particular context: they are often abbreviated so an
`<abbr/>` may apply, in *La Grande Encyclopédie*, biographies are not labeled by
`<abbr/>` may apply, in *LGE*, biographies are not labeled by
a knowledge domain but usually include the first name of the person when it is
known so in that case an element like `<persName/>` is still appropriate. This
choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1.
......@@ -812,8 +798,7 @@ prone to suffer damages or be misread by the OCR).
Finally there are other TEI elements useful to represent "events" in the flow of
the text, like the beginning of a new column of text or of a new page. Figure
@fig:alcala-photo shows the top left of the last page of the first tome of *La
Grande Encyclopédie* which features peritext elements while marking the
@fig:alcala-photo shows the top left of the last page of the first tome of *LGE* which features peritext elements while marking the
beginning of a new page. The usual appropriate elements (`<pb/>` for page
beginning, `<cb/>` for column beginning) may and should be used with this
encoding scheme as demonstrated by Figure @fig:alcala-xml.
......@@ -827,7 +812,7 @@ The reference implementation for this encoding scheme is the program soprano
developed within the scope of project DISCO-LGE to automatically identify
individual articles in the flow of raw text from the columns and to encode them
into XML-TEI files. Though this software has already been used to produce the
first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
first TEI version of *LGE*, it does not follow perfectly yet
the specification described in this chapter. Figure @fig:cathete-xml-current
shows the encoded version of article "Cathète" it currently produces:
......@@ -860,8 +845,8 @@ by `soprano` when inferring the reading order before segmenting the articles.
## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *Encyclopédie* comprises
over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
containing several tenths of thousands of articles. The *EDdA* comprises
over 74k articles and *LGE* certainly more than 100k (the latest
version produced by `soprano` created 160k articles, but their segmentation is
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
......@@ -886,11 +871,11 @@ to the main principle on which this scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In *La Grande
Encyclopédie*, virtually all the redirections appear within parenthesis (at
least no counter-example has been found within the scope of the project), and
start with the verb "voir" abbreviated as a single, capital "V." as illustrated
in the article "Gelocus" (see again Figure @fig:gelocus-photo).
token which invites the reader to perform the redirection. In *LGE*, virtually
all the redirections appear within parenthesis (at least no counter-example has
been found within the scope of the project), and start with the verb "voir"
abbreviated as a single, capital "V." as illustrated in the article "Gelocus"
(see again Figure @fig:gelocus-photo).
Although this has not been implemented yet either, being able to detect and
exploit those patterns to correctly encode cross-references does not pose any
......@@ -913,7 +898,7 @@ of the following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over three minutes per tome, the total processing time can be
lowered to around forty minutes on a machine with 16Go of RAM for the whole of
*La Grande Encyclopédie* instead of over one hour and a half.
*LGE* instead of over one hour and a half.
## Comparison to other approaches
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment