Fixed the begining it seems

34300a87 · Alice Brenon · 68acc9d8 · 34300a87
Commit 34300a87 authored 1 year ago by Alice Brenon
--- a/ICHLL_Brenon.md
+++ b/ICHLL_Brenon.md
@@ -57,8 +57,8 @@ arts et des métiers* (hence *EDdA*) by @dalembert_dictionnaire_2022 [article
 DICTIONNAIRE, volume 4] who opposes three kinds of dictionaries: one to define
 *words*, the second to define *facts* and the last one to define *things*,
 corresponding respectively to language, history, and science and arts
-dictionaries. The first type corresponds to our modern dictionaries while the
-two others are similar to what one expects to find in an encyclopedia.
+dictionaries. The first type corresponds to modern dictionaries while the two
+others are similar to what one expects to find in an encyclopedia.

 However, d'Alembert himself doesn't think of these boundaries as absolute and he
 hints at the extreme difficulty in merely defining words without going into
@@ -67,8 +67,8 @@ semantics and philosophical considerations:
 > un dictionnaire de langues, qui paroît n'être qu'un dictionnaire de mots, doit
 > être souvent un dictionnaire de choses quand il est bien fait

-(*a language dictionary, which appears to be only a word dictionary, must often
-be a thing dictionary when it is made properly*). A similar criticism is made by
+("a language dictionary, which appears to be only a word dictionary, must often
+be a thing dictionary when it is made properly"). A similar criticism is made by
 @haiman_dictionaries_1980 [p. 331] who attacks no less than six criteria on
 which dictionaries and encyclopedias are generally opposed to reach the
 conclusion that there is no distinction between them because "dictionaries *are*
@@ -107,6 +107,11 @@ the possibilities and shortcomings of the TEI *dictionaries* module.

 # Context of the study

+To give a better understanding of this research, this section describes
+the aims of the project from which it stems before giving a short history of the
+term *encyclopedia* and underlining the known differences between dictionaries
+and encyclopedias which constitute the starting point of this investigation.
+
 ## CollEx-Persée Project DISCO-LGE

 The project
@@ -116,13 +121,13 @@ Lettres et des Arts par une Société de savants et de gens de lettres* (hence
 *LGE*), an encyclopedia published in France between 1885 and 1902 by an
 organised team of over two hundred specialists divided into eleven sections.
 This text comprises 31 tomes of about 1200 pages each and according to
-@jacquet-pfau2015 [, pp. 88 et seq.] was the last major french encyclopedic
+@jacquet-pfau2015 [pp. 88 et seq.] was the last major french encyclopedic
 endeavour directly inheriting from the prestigious ancestor that was the *EDdA*
 published by Diderot and d'Alembert 130 years earlier, between 1751 and 1772.

-The aim of the project was to digitise and make *La Grande Encyclopédie*
-available to the scientific community as well as the general public. A previous
-version of this encyclopedia was partially available on Gallica
+The aim of the project was to digitise and make *LGE* available to the
+scientific community as well as the general public. A previous version of this
+encyclopedia was partially available on Gallica
 ([https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&collapsing=disabled&query=dc.relation%20all%20%22cb377013071%22](https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&collapsing=disabled&query=dc.relation%20all%20%22cb377013071%22))
 but lacked in quality and its text had not been fully extracted from the
 pictures with an Optical Characters Recognition (OCR) system. This prevented an
@@ -130,20 +135,18 @@ exhaustive study of the text with textometry tools such as TXM [@heiden2010]. As
 a prelude to project GEODE
 ([https://geode-project.github.io/](https://geode-project.github.io/)), the goal
 of CollEx-Persée was to produce a digital version of *LGE* with a quality
-comparable to the one of l'*Encyclopédie* provided by the ARTFL
+comparable to the one of l'*EDdA* provided by the ARTFL
 ([http://artfl-project.uchicago.edu/](http://artfl-project.uchicago.edu/))
 project in order to conduct a diachronic study of both encyclopedias.

 ## *Encyclopedia*

-In common parlance, the terms "dictionaries" and "encyclopedias" are used as
-near synonyms to refer to books compiling vast amounts of knowledge into lists
-of definitions ordered alphabetically. Their similarity is even visible in the
-way they are coordinated in the full title of the *Encyclopédie* which is
-probably the most famous work of the genre and a symbol of the Age of
-Enlightenment. If the word "encyclopedia" is nowadays part of everyday
-vocabulary, it was much more unusual and in fact controversial when Diderot and
-d'Alembert decided to use it in the title of their book.
+If the word "encyclopedia" is now part of everyday vocabulary and has a slightly
+different meaning from dictionary, it was much more unusual and in fact
+controversial when Diderot and d'Alembert decided to use it in the title of
+their book, while having to coordinate them both in the full title of the *EDdA*
+which is probably the most famous work of the genre and a symbol of the Age of
+Enlightenment. 

 The definition given by Furetière in his *Dictionnaire Universel* in 1690 is
 still close to its greek etymology: a "ring of all knowledges", from *κύκλος*,
@@ -166,16 +169,16 @@ that intent, he quotes a poem from Pibrac encouraging people to specialise in
 only one discipline lest they should not reach perfection, based on an
 argumentation that resembles the saying "Jack of all trades, master of none". It
 is all the more interesting that the definition remains unaltered until 1752,
-one year after the publication of the first volume of the *Encyclopédie*. The
+one year after the publication of the first volume of the *EDdA*. The
 Jesuites who edited *Dictionnaire de Trevoux* frowned upon the project of the
-*Encyclopédie* which they managed to get banned the same year by the Council of
+*EDdA* which they managed to get banned the same year by the Council of
 State on the charge of attempting to destroy the royal authority, inspiring
 rebellion and corrupting morality in general. There is much more at stake than
 words here, but the attempt to deprecate the word itself is part of their fight
 against the philosophers of the Enlightenment.

 The attacks do not remain ignored by Diderot who starts the very definition of
-the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
+the word "Encyclopédie" in the *EDdA* itself by a strong rebuttal. He
 directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
 mere self-doubt that their authors should not generalise to anyone, then leaves
 the main point to a latin quote by chancelor Bacon [@lojkine2013, p. 5], who argues
@@ -185,7 +188,7 @@ lifetime may be achieved by a common effort throughout generations.

 History hints that Diderot's opponents took his defence of the feasability of
 the project quite seriously, considering the fact that they got the
-*Encyclopédie*'s privileges to be revoked again six years after its publication
+*EDdA*'s privileges to be revoked again six years after its publication
 was resumed [@moureau2001]. As a consequence, the remaining ten volumes
 containing the text of the articles had to be published illegally until 1765,
 thanks to the secret protection of Malesherbes who — despite being head of royal
@@ -215,29 +218,20 @@ If encyclopedias are thus historically more recent than dictionaries they also
 depart from the latter on their approach. The purpose of dictionaries from their
 origin is to collect words, to make an exhaustive inventory of the terms used in
 a domain or in a language in order to associate a *definition* to them, be it a
-translation in another language for a foreign language dictionary or a phrase
-explaining it for other dictionaries. As such, they are collections of *signs*
-and remain within the linguistic level of things. Entries in a dictionary often
-feature information such as the part of speech, the pronunciation or the
-etymology of the word they define.
-
-# <FIXME
-
-The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
-types of dictionaries: one to define *words*, the second to define *facts* and
-the last one to define *things*, corresponding to the distinction between
-language, history, and science and arts dictionaries although according to its
-author, d'Alembert, each has to be of more than just one kind to be really good.
-In the full title of the *Encyclopédie*, the concept is more or less equated by
-means of the coordinating conjunction "ou" to a *Dictionnaire raisonné*,
-"reasoned dictionary", introducing the idea of encyclopedias as dictionaries
-with additional structure and a philosophical dimension.
-
-# FIXME>
+phrase explaining it or a translation in another language for a foreign language
+dictionary. As such, they are collections of *signs* and are more concerned with
+the linguistic level of things. Entries in a dictionary often feature
+information such as the part of speech, the pronunciation or the etymology of
+the word they define.
+
+In the full title of the *EDdA*, the concept of encyclopedia is more or less
+equated by means of the coordinating conjunction "ou" to a *Dictionnaire
+raisonné*, "reasoned dictionary", introducing the idea that encyclopedias are
+dictionaries with some additional structure and a philosophical dimension.

 Back to the "Encyclopédie" article one can read that a dictionary remaining
 strictly at the language level, a vocabulary, can be seen as the empty frame
-required for an encyclopedic dictionary that will fill it with additional depth.
+required for an encyclopedic dictionary which will fill it with additional depth.
 Given how d'Alembert insists on the importance of brevity for a clear definition
 in the "Dictionnaire de Langues" entry, it is clear that the *encyclopédistes*
 did not consider encyclopedias superior to dictionaries but really as a new
@@ -245,33 +239,20 @@ subgenre departing from them in terms of purpose.

 # The *dictionaries* TEI module {#sec:dictionaries-module}

-# <FIXME
-The XML-TEI toolbox has a modular structure consisting of optional parts each
-covering specific needs such as the physical features of a source document, the
-transcription of oral corpora or particular requirements for textual domains
-like poetry, or, in the case at hand, dictionaries. After describing why the dedicated
-module was a natural candidate to consider, I formalise tools from graph
-theory to browse the specifications of this guideline in a rational way and
-explore this module in detail.
-# FIXME>
+One of the main motivation behind project DISCO-LGE was to produce data useful
+to future scientific projects, which in particular requires it to be
+*interoperable* and *reusable*. These are the two last key aspects of the FAIR
+([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/))
+principles (*findability*, *accessibility*, *interoperability* and
+*reusability*) which are important guideline for efficient, high-quality
+research. The XML-TEI guidelines provide tools to achieve this goal. This
+section therefore starts by describing the existing toolset it provides, before
+introducing some notations and tools from graph theory which will be used to
+browse the guidelines in a systematic and thorough way in section
+@sec:new-standard. 

 ## A good starting point {#sec:starting-point}

-Data produced in the context of a project such as DISCO-LGE cannot be useful to
-future scientific projects unless it is *interoperable* and *reusable*. These
-are the two last key aspects of the FAIR
-([https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)) principles (*findability*,
-*accessibility*, *interoperability* and *reusability*) which I strive to follow
-as a guideline for efficient and quality research.
-
-# <FIXME
-It entails using standard
-formats and a standard for encoding historical texts in the context of digital
-humanities is XML-TEI, collectively developped by the *Text Encoding Initiative*
-consortium which publishes a set of technical specifications under the form of
-XML schemas, along with a range of tools to handle them and training resources.
-# FIXME>
-
 The *dictionaries* module has been leveraged to encode dictionaries in projects
 NENUFAR
 ([https://cahier.hypotheses.org/nenufar](https://cahier.hypotheses.org/nenufar))
@@ -291,10 +272,10 @@ reasons, the encoding schemes used in these projects could not be reused
 directly, prompting for a systematic exploration of the XML-TEI schema to devise
 a new one.

-This chapter discusses XML elements in depth and hence needs to name and
-manipulate them. They will be represented in a monospace font, in the standard
-XML autoclosing form within angle brackets and with a slash following the
-element name like `<div/>` for a `div` element
+This chapter discusses XML elements and hence needs to name and manipulate them.
+They will be represented in a monospace font, in the standard XML autoclosing
+form within angle brackets and with a slash following the element name like
+`<div/>` for a `div` element
 ([https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html)).
 This notation does not mean to imply that they cannot contain raw text or other
 XML elements, it merely denotes such an element, without any additional
@@ -312,6 +293,11 @@ browsing such an massive network can prove quite difficult as the number of
 combinations sharply increases with each step.

 The problem can be advantageously transformed by representing this network as a
+graph to benefit from the results of graph theory. Classical, well-known methods
+such as Dĳkstra's algorithm [@dĳkstra59] which computes the shortest path
+between two nodes in a graph can then be applied 
+
+
 directed graph, using elements of XML-TEI as nodes and placing edges if the
 destination node may be contained within the source node according to the
 schema. Please note that the word "element" is here used with the same meaning
@@ -492,9 +478,9 @@ Thus, despite a rather dense internal connectivity, the *dictionaries* module
 fails to provide encoders with a device to represent recursively nesting
 structures like `<div/>`.

-# A new standard ?
+# A new standard ? {#sec:new-standard}

-Studying the content of *La Grande Encyclopédie* and considering several
+Studying the content of *LGE* and considering several
 articles in particular, one can identify structures which are specific to
 encyclopedias and not compatible with the *dictionaries* module presented in the
 previous section. It follows that this module is not able to encode arbitrary
@@ -512,11 +498,11 @@ of the great variety in terms of editorial choices the most obvious can be
 discussed.

 The first immediately visible feature that sets encyclopedias apart from
-dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
-Encyclopédie* is the presence of subject indicators at the beginning of articles
-right after the headword which organise them into a domain classification
-system. Those generally cover a broad range of subjects from scientific
-disciplines to litterature, and extending to political subjects and law.
+dictionaries and can be found in the *EDdA* as well as in *LGE* is the presence
+of subject indicators at the beginning of articles right after the headword
+which organise them into a domain classification system. Those generally cover a
+broad range of subjects from scientific disciplines to litterature, and
+extending to political subjects and law.

 No element in the *dictionaries* module is explicitely designed for the purpose
 of encoding these indicators. As section @sec:dictionaries-module illustrates,
@@ -530,7 +516,7 @@ context in which the word can be used, as expected from the element's name,
 "usage". In encyclopedias, if the domain indicator does in certain cases help to
 distinguish between several entries sharing the same headword, the concept
 itself has evolved beyond this mere distinction. Looking back at the
-*Encyclopédie*, the adjective *raisonné* in the rest of the title directly
+*EDdA*, the adjective *raisonné* in the rest of the title directly
 introduces a notion of structure that links back to the "Systême figuré des
 connoissances humaines" [@blanchard2002, p. 1] which schematic structure is
 shown in Figure @fig:systeme-figure. The authors have devised a branching system
@@ -558,14 +544,14 @@ relevant.

 Notwithstanding the correct way to represent domains of knowledge, their extent
 itself raises concerns regarding the *dictionaries* module. Indeed, among the
-vast collection of domains covered in encyclopedias in general and in *La Grande
-Encyclopédie* in particular are historical articles and biographies. If the
-notion of meaning can appear at least ill-fitting for a text describing a series
-of historical events, one may still argue that it groups them into a concept and
-associates it to the name of the event. But when it comes to relating the life
-of a person, describing their relation to events and other persons comes out
-even further from the notion of meaning. Entries such as the one about SANJO
-Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
+vast collection of domains covered in encyclopedias in general and in *LGE* in
+particular are historical articles and biographies. If the notion of meaning can
+appear at least ill-fitting for a text describing a series of historical events,
+one may still argue that it groups them into a concept and associates it to the
+name of the event. But when it comes to relating the life of a person,
+describing their relation to events and other persons comes out even further
+from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
+Figure @fig:sanjo) do not constitute a *definition*.

 ![Begining of the article relating the life of SANJO Sanetomi, in La Grande Encyclopédie, tome 29 ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/sanjo_t29.png){#fig:sanjo}

@@ -745,7 +731,7 @@ to the abstract objects that mathematics or poetry are).

 For this reason, no particular encoding of the subject indicator is recommended
 and it is left open to each particular context: they are often abbreviated so an
-`<abbr/>` may apply, in *La Grande Encyclopédie*, biographies are not labeled by
+`<abbr/>` may apply, in *LGE*, biographies are not labeled by
 a knowledge domain but usually include the first name of the person when it is
 known so in that case an element like `<persName/>` is still appropriate. This
 choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1.
@@ -812,8 +798,7 @@ prone to suffer damages or be misread by the OCR).

 Finally there are other TEI elements useful to represent "events" in the flow of
 the text, like the beginning of a new column of text or of a new page. Figure
-@fig:alcala-photo shows the top left of the last page of the first tome of *La
-Grande Encyclopédie* which features peritext elements while marking the
+@fig:alcala-photo shows the top left of the last page of the first tome of *LGE* which features peritext elements while marking the
 beginning of a new page. The usual appropriate elements (`<pb/>` for page
 beginning, `<cb/>` for column beginning) may and should be used with this
 encoding scheme as demonstrated by Figure @fig:alcala-xml.
@@ -827,7 +812,7 @@ The reference implementation for this encoding scheme is the program soprano
 developed within the scope of project DISCO-LGE to automatically identify
 individual articles in the flow of raw text from the columns and to encode them
 into XML-TEI files. Though this software has already been used to produce the
-first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
+first TEI version of *LGE*, it does not follow perfectly yet
 the specification described in this chapter. Figure @fig:cathete-xml-current
 shows the encoded version of article "Cathète" it currently produces:

@@ -860,8 +845,8 @@ by `soprano` when inferring the reading order before segmenting the articles.
 ## The constraints of automated processing

 Encyclopedias are particularly long books, spanning numerous tomes and
-containing several tenths of thousands of articles. The *Encyclopédie* comprises
-over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
+containing several tenths of thousands of articles. The *EDdA* comprises
+over 74k articles and *LGE* certainly more than 100k (the latest
 version produced by `soprano` created 160k articles, but their segmentation is
 still not perfect and if some article beginning remain undetected, all the very
 long and deeply-structured articles are unduly split into many parts, resulting
@@ -886,11 +871,11 @@ to the main principle on which this scheme is based. Actually, the process of
 linking from an article to another one is so frequent (in dictionaries as well
 as in encyclopedias) that it generally escapes the scope of regular discourse to
 take a special and often fixed form, inside parenthesis and after a special
-token which invites the reader to perform the redirection. In *La Grande
-Encyclopédie*, virtually all the redirections appear within parenthesis (at
-least no counter-example has been found within the scope of the project), and
-start with the verb "voir" abbreviated as a single, capital "V." as illustrated
-in the article "Gelocus" (see again Figure @fig:gelocus-photo).
+token which invites the reader to perform the redirection. In *LGE*, virtually
+all the redirections appear within parenthesis (at least no counter-example has
+been found within the scope of the project), and start with the verb "voir"
+abbreviated as a single, capital "V." as illustrated in the article "Gelocus"
+(see again Figure @fig:gelocus-photo).

 Although this has not been implemented yet either, being able to detect and
 exploit those patterns to correctly encode cross-references does not pose any
@@ -913,7 +898,7 @@ of the following article. This allows the amount of memory required to remain
 reasonable and even lets them be parallelised on most modern machines. Thus,
 even taking over three minutes per tome, the total processing time can be
 lowered to around forty minutes on a machine with 16Go of RAM for the whole of
-*La Grande Encyclopédie* instead of over one hour and a half.
+*LGE* instead of over one hour and a half.

 ## Comparison to other approaches