Fix typos and other mistakes

8c91b0a6 · Alice Brenon · dadd4cb5 · 8c91b0a6
Commit 8c91b0a6 authored 3 years ago by Alice Brenon
--- a/ICHLL_Brenon.md
+++ b/ICHLL_Brenon.md
@@ -109,7 +109,7 @@ against the philosophers of the Enlightenment.
 The attacks do not remain ignored by Diderot who starts the very definition of
 the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
 directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
-mere self-doubt that their authors shouldn't generalise to mankind, then leaves
+mere self-doubt that their authors should not generalise to mankind, then leaves
 the main point to a latin quote by chancelor Bacon, who argues that a
 collaborative work can achieve much more than any talented man could: what could
 possibly not be within reach of a single man, within a single lifetime may be
@@ -117,14 +117,14 @@ achieved by a common effort throughout generations.

 History hints that Diderot's opponents took his defense of the feasability of
 the project quite seriously, considering the fact that they got the
-*Encyclopédie*'s priviledges to be revoked again six years after its publication
+*Encyclopédie*'s privileges to be revoked again six years after its publication
 was resumed. As a consequence, the remaining ten volumes containing the text of
 the articles had to be published illegally until 1765, thanks to the secret
 protection of Malesherbes who — despite being head of royal censorship — saved
 the manuscripts from destruction. They were printed secretly outside of Paris
 and the books were (falsely) labeled as coming from Neufchâtel. Following the
 high demand from the booksellers who feared they would lose the money they had
-invested in the project, a special priviledge was issued for the volumes
+invested in the project, a special privilege was issued for the volumes
 containing the plates, which were released publicly from 1762 to 1772.

 In any case, in their last edition in 1771 the authors of the *Dictionnaire de
@@ -143,14 +143,14 @@ knowledge itself.

 ## A different approach

-If encyclopedia are thus historically more recent than dictionaries they also
+If encyclopedias are thus historically more recent than dictionaries they also
 depart from the latter on their approach. The purpose of dictionaries from their
-origin is to collect words, to make an exhaustive inventory of the terms
-used in a domain or in a language in order to associate a *definition* to them,
-be it a translation in another language for a foreign language dictionary or a
-phrase explaining it for other dictionaries. As such, they are collections of
-*signs* and remain within the linguistic level of things. Entries in a dictionary
-often feature information such as the part of speech, the pronunciation or the
+origin is to collect words, to make an exhaustive inventory of the terms used in
+a domain or in a language in order to associate a *definition* to them, be it a
+translation in another language for a foreign language dictionary or a phrase
+explaining it for other dictionaries. As such, they are collections of *signs*
+and remain within the linguistic level of things. Entries in a dictionary often
+feature information such as the part of speech, the pronunciation or the
 etymology of the word they define.

 The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
@@ -180,12 +180,12 @@ These are the two last key aspects of the FAIR[^FAIR] principles (*findability*,
 as a guideline for efficient and quality research. It entails using standard
 formats and a standard for encoding historical texts in the context of digital
 humanities is XML-TEI, collectively developped by the *Text Encoding Initiative*
-consortium.  It consists in a set of technical specifications under the form of
+consortium. It publishes a set of technical specifications under the form of
 XML schemas, along with a range of tools to handle them and training resources.

 [^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)

-The XML-TEI standard has a modular structure consisting in optional parts each
+The XML-TEI standard has a modular structure consisting of optional parts each
 covering specific needs such as the physical features of a source document, the
 transcription of oral corpora or particular requirements for textual domains
 like poetry, or, in our case, dictionaries.
@@ -239,8 +239,8 @@ the graph (that is an edge from a node to itself) as can be illustrated by the
 another one.

 The generalisation of this to inclusion paths of any length greater than one is
-usually called a cycle in and we may be tempted in our context to refine this
-name them to *inclusion cycles*.  The `<address/>` element provides us with an
+usually called a cycle and we may be tempted in our context to refine this and
+name them *inclusion cycles*. The `<address/>` element provides us with an
 example for this configuration: although an `<address/>` element may not
 directly contain another one, it may contain a `<geogName/>` which, in turn, may
 contain a new `<address/>` element.  From a graph theory perspective, we can say
@@ -261,7 +261,7 @@ between two nodes that a human specialist of the TEI framework could build.
 This is still very useful when taking into account the fact that TEI modules are
 merely "bags" to group the elements and provide hints to human encoders about
 the tools they might need but have no implication on the inclusion paths between
-element which cross module boundaries freely. The general graph formalism
+elements which cross module boundaries freely. The general graph formalism
 enables us to describe complex filtering patterns and to implement queries to
 look for them among the elements exhaustively by algorithmic means even when the
 shortest-path approach is not enough.
@@ -315,10 +315,10 @@ represent features such as

 - its written and spoken forms: `<form/>`
 - a group of grammatical information: `<gramGrp/>`, that may itself contain as
-  we've seen above `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to describe the
-  form itself for instance, but also information about the categories it belongs
-  to like `<iType/>` for its inflection class in languages with a declension
-  system or `<pos/>` for its part-of-speech
+  previously demonstrated `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to
+  describe the form itself for instance, but also information about the
+  categories it belongs to like `<iType/>` for its inflection class in languages
+  with a declension system or `<pos/>` for its part-of-speech
 - its etymology: `<etym/>`
 - its variants if there is a different spelling in a variety of the language or
  if it has changed through time: `<usg/>` (though it is not its only purpose)
@@ -389,7 +389,7 @@ all the paths from either `<entry/>` or `<sense/>` elements to the latter of
 length shorter or equal to 5 by a systematic traversal of the graph yields
 exclusively paths (respectively 9042 and 39093 of them) containing either a
 `<floatingText/>` or an `<app/>` element. The first one, as its name aptly
-suggests, is used to encode text that doesn't quite fit the regular flow of the
+suggests, is used to encode text that does not quite fit the regular flow of the
 document, as for example in the context of an embedded narrative. Both examples
 displayed in the online documentation feature a `<body/>` as direct child of
 `<floatingText/>`, neatly separating its content as independent. The purpose of
@@ -424,8 +424,8 @@ the most obvious.
 ### Organised knowledge

 The first immediately visible feature that sets encyclopedias apart from
-dictionaris can be found in the *Encyclopédie* as well in *La Grande
-Encyclopédie* is the presence of subject indicators at the begining of articles
+dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
+Encyclopédie* is the presence of subject indicators at the beginning of articles
 right after the headword which organise them into a domain classification
 system. Those generally cover a broad range of subjects from scientific
 disciplines to litterature, and extending to political subjects and law.
@@ -438,14 +438,14 @@ tool for what we need is found in the `<usg/>` element used with a specific
 documentation encode subject indicators very similar to the ones found in
 encyclopedias within this element, but the match is not perfect either: all
 appear within one of multiple senses, as if to clarify each context in which the
-word can be used, as expected from the element's name, "usage". In encyclopedia,
+word can be used, as expected from the element's name, "usage". In encyclopedias,
 if the domain indicator does in certain cases help to distinguish between
 several entries sharing the same headword, the concept itself has evolved beyond
 this mere distinction. Looking back at the *Encyclopédie*, the adjective
 *raisonné* in the rest of the title directly introduces a notion of structure
 that links back to the "Systême figuré des connoissances humaines". The authors
 have devised a branching system to classify all knowledge, and the occurrence at
-the begining of articles, more than a tool to clear up possible ambiguities also
+the beginning of articles, more than a tool to clear up possible ambiguities also
 points the reader to the correct place in this mind map.

 !["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie](ressources/arbre.png){width=200px}
@@ -455,8 +455,8 @@ module. The `<domain/>` element despite its name belongs exclusively in the
 header of a document and focuses on the social context of the text, not on the
 knowledge area it covers. The `<interp/>` despite its name is not so much about
 labeling something as an interpretation to give to a context (which subject
-indicators could be if you consider that, placed at the begining, they are used
-to orient the mind frame of the readers towards a particular subject). However,
+indicators could be if you consider that, placed at the beginning, they are used
+to direct the mind frame of the readers towards a particular subject). However,
 the documentation clearly demonstrates it as a tool for annotators of a
 document, which text content is not part of the original document but some
 additional result of an analysis performed in the context of the encoding, used
@@ -518,7 +518,7 @@ The nested structure that we have just evidenced demands of course a nesting
 structure to accomodate it. More precisely it guides our search of XML elements
 by giving us several constraints: we are looking for a pair of elements, the
 first representing a (sub)section must be able to include both itself and the
-second element, which doesn't have any special constraint in addition to the one
+second element, which does not have any special constraint in addition to the one
 it shares with the first, which is to have a semantics compatible with our
 purpose. In addition, the first element must be able to contain several `<p/>`
 elements, `<p/>` being the reference element to encode paragraphs according to
@@ -647,20 +647,20 @@ For this reason, we do not recommend any special encoding of the subject
 indicator but leave it open to each particular context: they are often
 abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
 are not labeled by a knowledge domain but usually include the first name of the
-person when it is known so in that case a element like `<persName/>` is still
+person when it is known so in that case an element like `<persName/>` is still
 appropriate.

 ![](snippets/cathète_1.png)

-We propose to then wrap each different meaning in a separate `<div/>` with the
-`type` attribute set to `sense` to refer to the `<sense/>` element that would've
-been used within the *core* module. Each sense should be numbered with the `n`
-attribute.
+We then propose to wrap each different meaning in a separate `<div/>` with the
+`type` attribute set to `sense` to refer to the `<sense/>` element that would
+have been used within the *core* module. Each sense should be numbered with the
+`n` attribute.

 ![](snippets/cathète_2.png)

 In addition, each line within the article must start with a `<lb/>` to mark its
-begining including before the `<head/>` element, which, although a surprising
+beginning including before the `<head/>` element, which, although a surprising
 setup, underlines the fact that in the dense layout of encyclopedias, the
 carriage return separating two articles is meaningful. Stating each new line
 explicitly keeps enough information to reconstruct a faithful facsimile but it
@@ -709,8 +709,8 @@ recognised (those short elements on the border of pages are the ones typically
 prone to suffer damages or be misread by the OCR).

 Finally there are other TEI elements useful to represent "events" in the flow of
-the text, like the begining of a new column of text or of a new page. The usual
-appropriate elements (`<pb/>` for page begining, `<cb/>` for column begining)
+the text, like the beginning of a new column of text or of a new page. The usual
+appropriate elements (`<pb/>` for page beginning, `<cb/>` for column beginning)
 may and should be used with our encoding scheme.

 ![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](ressources/last_page_top_left_t1.png){width=350px}
@@ -724,8 +724,8 @@ soprano[^soprano] developed within the scope of project DISCO-LGE to
 automatically identify individual articles in the flow of raw text from the
 columns and to encode them into XML-TEI files. Though this software has already
 been used to produce the first TEI version of *La Grande Encyclopédie*, it
-doesn't yet follow the above specification perfectly. Here is for instance the
-encoded version of article "Cathète" currently it produces:
+does not yet follow the above specification perfectly. Here is for instance the
+encoded version of article "Cathète" it currently produces:

 [^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)

@@ -736,12 +736,11 @@ so it appears outside of the `<head/>` element. No work is performed either to
 expand abbreviations and encode them as such, or to distinguish between domain
 and people names.

-Likewise, since the detection of titles at the begining of each section isn't
-complete and so no structure analysis is performed on the content of the article
-which is placed directly under the article's `<div/>` element at the moment
-instead of under a set of nested `<div/>` elements, the topmost having a `type`
-attribute of `sense`. The paragraphs are not yet identified and hence not
-encoded.
+Likewise, since the detection of titles at the beginning of each section is not
+complete, no structure analysis can be performed at the moment on the textual
+development inside the article and it is left unstructured, directly under the
+entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
+paragraphs are not yet identified and for this reason not encoded.

 However, the figures and their captions are already handled correctly when they
 occur. The encoder also keeps track of the current lines, pages, and columns and
@@ -749,7 +748,7 @@ inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
 numbers pages so that the numbering corresponding to the physical pages are
 available, as compared to the "high-level" pages numbers inserted by the
 editors, which start with an offset because the first, blank or almost empty
-pages at the begining of each book do not have a number and which sometimes have
+pages at the beginning of each book do not have a number and which sometimes have
 gaps when a full-page geographical map is inserted since those are printed
 separately on a different folio which remains outside of the textual numbering
 system. The place at which these layout-related elements occur is determined by
@@ -760,17 +759,17 @@ by `soprano` when inferring the reading order before segmenting the articles.

 Encyclopedias are particularly long books, spanning numerous tomes and
 containing several tenths of thousands of articles. The *Encyclopédie* comprises
-over 74k articles and *La Grande Encyclopédie* certainly more 100k (the latest
+over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
 version produced by `soprano` produced 160k articles, but their segmentation is
-still not perfect and if some article begining remain undetected, all the very
+still not perfect and if some article beginning remain undetected, all the very
 long and deeply-structured articles are unduly split into many parts, resulting
-globally in an over-estimation of the total number). In any case, it consists of
+globally in an overestimation of the total number). In any case, it consists of
 31 tomes of 1200 pages each.

 XML-TEI is a very broad tool useful for very different applications. Some
 elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
 information (for the second case, adjacent to a notion as elusive as truth)
-which require a very deep understanding of a text in its entirety and about
+which requires a very deep understanding of a text in its entirety and about
 which even some human experts may disagree.

 For these reasons, a central concern in the design of our encoding scheme was to
@@ -796,7 +795,7 @@ capital "V." as illustrated above in the article "Gelocus".
 Although this has not been implemented yet either, we hope to be able to detect
 and exploit those patterns to correctly encode cross-references. Getting the
 `target` attributes right is certainly more difficult to achieve and may require
-processing the articles in several steps, to firsrt discover all the existing
+processing the articles in several steps, to first discover all the existing
 headwords — and hence article IDs — before trying to match the words following
 "V." with them. Since our automated encoder handles tomes separately and since
 references may cross the boundaries of tomes, it cannot wait for the target of a
@@ -808,11 +807,11 @@ lexicographers may deem our encoding too shallow, it has the advantage of not
 requiring to keep too complex datastructures in memory for a long time. The
 algorithm implementing it in `soprano` outputs elements as soon as it can, for
 instance the empty elements already discussed above. For articles, it pushes
-lines onto a stack and flushes it each time it encounters the begining of the
+lines onto a stack and flushes it each time it encounters the beginning of the
 following article. This allows the amount of memory required to remain
 reasonable and even lets them be parallelised on most modern machines. Thus,
-even taking over 3 mn per tome, the total processing time can be lowered to
-around 40 mn for the whole of *La Grande Encyclopédie* instead of over one hour
+even taking over three minutes per tome, the total processing time can be lowered to
+around forty minutes for the whole of *La Grande Encyclopédie* instead of over one hour
 and a half.

 ## Comparison to other approaches
@@ -850,9 +849,12 @@ between TEI elements and pushed us to look for different combinations. Another
 valid approach would have consisted in changing the structure of the inclusion
 graph itself, that is to say modify the rules. If `<entry/>` is the perfect
 element to encode article themselves, all that is really missing is the ability
-to accomodate nested structures with the `<div/>` element. Generating customized TEI
-schemas is made really easy with tools like ROMA[^ROMA], which we used to
-preview our change and suggest it to the TEI community.
+to accomodate nested structures with the `<div/>` element. This would also have
+the advantage of recovering the `<usg/>` and `<xr/>` elements which we have
+recognized as useful and which we lose as part of the tradeoff to get nested
+sections. Generating customized TEI schemas is made really easy with tools like
+ROMA[^ROMA], which we used to preview our change and suggest it to the TEI
+community.

 [^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/)

@@ -860,11 +862,15 @@ Despite it not getting a wide adhesion, some suggested it could be used locally
 within the scope of project DISCO-LGE. However we chose not to do so, partially
 for the same reasons of interoperability as the previous scenario, but also for
 reasons of sturdiness in front of future evolutions. Making sure the alternative
-schema would remain useful entails to maintain it regenerating it should the
-schema format evolve, with the possibility that the tools to edit it changes or
+schema would remain useful entails to maintain it, regenerating it should the
+schema format evolve, with the risk that the tools to edit it might change or
 stop being maintained.

-# Conclusion
+# Conclusion {-}
+
+- Dictionaries and encyclopedias are different
+- The *dictionaries* module is inadequate
+- We provide an encoding

 Despite long discussions and interesting proposals each with strong arguments both in
 favour of and against them, no consensus could be reached. For one part, each
@@ -875,3 +881,5 @@ Beyond the technical need for encodings generic enough to share the corpora
 within the community and compare the results accross various projects, the above
 results highlights one aspect of a well-known fact within the community of
 lexicography: encyclopedias and dictionaries differ on several key aspects
+
+# Bibliography {-}