diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index c82afd04ba297995d89152dc939dacf38976d2ba..8774acb6fddb7e4a321d32000f3e0a0b9412b5de 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -109,7 +109,7 @@ against the philosophers of the Enlightenment. The attacks do not remain ignored by Diderot who starts the very definition of the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as -mere self-doubt that their authors shouldn't generalise to mankind, then leaves +mere self-doubt that their authors should not generalise to mankind, then leaves the main point to a latin quote by chancelor Bacon, who argues that a collaborative work can achieve much more than any talented man could: what could possibly not be within reach of a single man, within a single lifetime may be @@ -117,14 +117,14 @@ achieved by a common effort throughout generations. History hints that Diderot's opponents took his defense of the feasability of the project quite seriously, considering the fact that they got the -*Encyclopédie*'s priviledges to be revoked again six years after its publication +*Encyclopédie*'s privileges to be revoked again six years after its publication was resumed. As a consequence, the remaining ten volumes containing the text of the articles had to be published illegally until 1765, thanks to the secret protection of Malesherbes who — despite being head of royal censorship — saved the manuscripts from destruction. They were printed secretly outside of Paris and the books were (falsely) labeled as coming from Neufchâtel. Following the high demand from the booksellers who feared they would lose the money they had -invested in the project, a special priviledge was issued for the volumes +invested in the project, a special privilege was issued for the volumes containing the plates, which were released publicly from 1762 to 1772. In any case, in their last edition in 1771 the authors of the *Dictionnaire de @@ -143,14 +143,14 @@ knowledge itself. ## A different approach -If encyclopedia are thus historically more recent than dictionaries they also +If encyclopedias are thus historically more recent than dictionaries they also depart from the latter on their approach. The purpose of dictionaries from their -origin is to collect words, to make an exhaustive inventory of the terms -used in a domain or in a language in order to associate a *definition* to them, -be it a translation in another language for a foreign language dictionary or a -phrase explaining it for other dictionaries. As such, they are collections of -*signs* and remain within the linguistic level of things. Entries in a dictionary -often feature information such as the part of speech, the pronunciation or the +origin is to collect words, to make an exhaustive inventory of the terms used in +a domain or in a language in order to associate a *definition* to them, be it a +translation in another language for a foreign language dictionary or a phrase +explaining it for other dictionaries. As such, they are collections of *signs* +and remain within the linguistic level of things. Entries in a dictionary often +feature information such as the part of speech, the pronunciation or the etymology of the word they define. The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three @@ -180,12 +180,12 @@ These are the two last key aspects of the FAIR[^FAIR] principles (*findability*, as a guideline for efficient and quality research. It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the *Text Encoding Initiative* -consortium. It consists in a set of technical specifications under the form of +consortium. It publishes a set of technical specifications under the form of XML schemas, along with a range of tools to handle them and training resources. [^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/) -The XML-TEI standard has a modular structure consisting in optional parts each +The XML-TEI standard has a modular structure consisting of optional parts each covering specific needs such as the physical features of a source document, the transcription of oral corpora or particular requirements for textual domains like poetry, or, in our case, dictionaries. @@ -239,8 +239,8 @@ the graph (that is an edge from a node to itself) as can be illustrated by the another one. The generalisation of this to inclusion paths of any length greater than one is -usually called a cycle in and we may be tempted in our context to refine this -name them to *inclusion cycles*. The `<address/>` element provides us with an +usually called a cycle and we may be tempted in our context to refine this and +name them *inclusion cycles*. The `<address/>` element provides us with an example for this configuration: although an `<address/>` element may not directly contain another one, it may contain a `<geogName/>` which, in turn, may contain a new `<address/>` element. From a graph theory perspective, we can say @@ -261,7 +261,7 @@ between two nodes that a human specialist of the TEI framework could build. This is still very useful when taking into account the fact that TEI modules are merely "bags" to group the elements and provide hints to human encoders about the tools they might need but have no implication on the inclusion paths between -element which cross module boundaries freely. The general graph formalism +elements which cross module boundaries freely. The general graph formalism enables us to describe complex filtering patterns and to implement queries to look for them among the elements exhaustively by algorithmic means even when the shortest-path approach is not enough. @@ -315,10 +315,10 @@ represent features such as - its written and spoken forms: `<form/>` - a group of grammatical information: `<gramGrp/>`, that may itself contain as - we've seen above `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to describe the - form itself for instance, but also information about the categories it belongs - to like `<iType/>` for its inflection class in languages with a declension - system or `<pos/>` for its part-of-speech + previously demonstrated `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to + describe the form itself for instance, but also information about the + categories it belongs to like `<iType/>` for its inflection class in languages + with a declension system or `<pos/>` for its part-of-speech - its etymology: `<etym/>` - its variants if there is a different spelling in a variety of the language or if it has changed through time: `<usg/>` (though it is not its only purpose) @@ -389,7 +389,7 @@ all the paths from either `<entry/>` or `<sense/>` elements to the latter of length shorter or equal to 5 by a systematic traversal of the graph yields exclusively paths (respectively 9042 and 39093 of them) containing either a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly -suggests, is used to encode text that doesn't quite fit the regular flow of the +suggests, is used to encode text that does not quite fit the regular flow of the document, as for example in the context of an embedded narrative. Both examples displayed in the online documentation feature a `<body/>` as direct child of `<floatingText/>`, neatly separating its content as independent. The purpose of @@ -424,8 +424,8 @@ the most obvious. ### Organised knowledge The first immediately visible feature that sets encyclopedias apart from -dictionaris can be found in the *Encyclopédie* as well in *La Grande -Encyclopédie* is the presence of subject indicators at the begining of articles +dictionaries and can be found in the *Encyclopédie* as well as in *La Grande +Encyclopédie* is the presence of subject indicators at the beginning of articles right after the headword which organise them into a domain classification system. Those generally cover a broad range of subjects from scientific disciplines to litterature, and extending to political subjects and law. @@ -438,14 +438,14 @@ tool for what we need is found in the `<usg/>` element used with a specific documentation encode subject indicators very similar to the ones found in encyclopedias within this element, but the match is not perfect either: all appear within one of multiple senses, as if to clarify each context in which the -word can be used, as expected from the element's name, "usage". In encyclopedia, +word can be used, as expected from the element's name, "usage". In encyclopedias, if the domain indicator does in certain cases help to distinguish between several entries sharing the same headword, the concept itself has evolved beyond this mere distinction. Looking back at the *Encyclopédie*, the adjective *raisonné* in the rest of the title directly introduces a notion of structure that links back to the "Systême figuré des connoissances humaines". The authors have devised a branching system to classify all knowledge, and the occurrence at -the begining of articles, more than a tool to clear up possible ambiguities also +the beginning of articles, more than a tool to clear up possible ambiguities also points the reader to the correct place in this mind map. {width=200px} @@ -455,8 +455,8 @@ module. The `<domain/>` element despite its name belongs exclusively in the header of a document and focuses on the social context of the text, not on the knowledge area it covers. The `<interp/>` despite its name is not so much about labeling something as an interpretation to give to a context (which subject -indicators could be if you consider that, placed at the begining, they are used -to orient the mind frame of the readers towards a particular subject). However, +indicators could be if you consider that, placed at the beginning, they are used +to direct the mind frame of the readers towards a particular subject). However, the documentation clearly demonstrates it as a tool for annotators of a document, which text content is not part of the original document but some additional result of an analysis performed in the context of the encoding, used @@ -518,7 +518,7 @@ The nested structure that we have just evidenced demands of course a nesting structure to accomodate it. More precisely it guides our search of XML elements by giving us several constraints: we are looking for a pair of elements, the first representing a (sub)section must be able to include both itself and the -second element, which doesn't have any special constraint in addition to the one +second element, which does not have any special constraint in addition to the one it shares with the first, which is to have a semantics compatible with our purpose. In addition, the first element must be able to contain several `<p/>` elements, `<p/>` being the reference element to encode paragraphs according to @@ -647,20 +647,20 @@ For this reason, we do not recommend any special encoding of the subject indicator but leave it open to each particular context: they are often abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies are not labeled by a knowledge domain but usually include the first name of the -person when it is known so in that case a element like `<persName/>` is still +person when it is known so in that case an element like `<persName/>` is still appropriate.  -We propose to then wrap each different meaning in a separate `<div/>` with the -`type` attribute set to `sense` to refer to the `<sense/>` element that would've -been used within the *core* module. Each sense should be numbered with the `n` -attribute. +We then propose to wrap each different meaning in a separate `<div/>` with the +`type` attribute set to `sense` to refer to the `<sense/>` element that would +have been used within the *core* module. Each sense should be numbered with the +`n` attribute.  In addition, each line within the article must start with a `<lb/>` to mark its -begining including before the `<head/>` element, which, although a surprising +beginning including before the `<head/>` element, which, although a surprising setup, underlines the fact that in the dense layout of encyclopedias, the carriage return separating two articles is meaningful. Stating each new line explicitly keeps enough information to reconstruct a faithful facsimile but it @@ -709,8 +709,8 @@ recognised (those short elements on the border of pages are the ones typically prone to suffer damages or be misread by the OCR). Finally there are other TEI elements useful to represent "events" in the flow of -the text, like the begining of a new column of text or of a new page. The usual -appropriate elements (`<pb/>` for page begining, `<cb/>` for column begining) +the text, like the beginning of a new column of text or of a new page. The usual +appropriate elements (`<pb/>` for page beginning, `<cb/>` for column beginning) may and should be used with our encoding scheme. {width=350px} @@ -724,8 +724,8 @@ soprano[^soprano] developed within the scope of project DISCO-LGE to automatically identify individual articles in the flow of raw text from the columns and to encode them into XML-TEI files. Though this software has already been used to produce the first TEI version of *La Grande Encyclopédie*, it -doesn't yet follow the above specification perfectly. Here is for instance the -encoded version of article "Cathète" currently it produces: +does not yet follow the above specification perfectly. Here is for instance the +encoded version of article "Cathète" it currently produces: [^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano) @@ -736,12 +736,11 @@ so it appears outside of the `<head/>` element. No work is performed either to expand abbreviations and encode them as such, or to distinguish between domain and people names. -Likewise, since the detection of titles at the begining of each section isn't -complete and so no structure analysis is performed on the content of the article -which is placed directly under the article's `<div/>` element at the moment -instead of under a set of nested `<div/>` elements, the topmost having a `type` -attribute of `sense`. The paragraphs are not yet identified and hence not -encoded. +Likewise, since the detection of titles at the beginning of each section is not +complete, no structure analysis can be performed at the moment on the textual +development inside the article and it is left unstructured, directly under the +entry's `<div/>` element instead of under a set of nested `<div/>` elements. The +paragraphs are not yet identified and for this reason not encoded. However, the figures and their captions are already handled correctly when they occur. The encoder also keeps track of the current lines, pages, and columns and @@ -749,7 +748,7 @@ inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and numbers pages so that the numbering corresponding to the physical pages are available, as compared to the "high-level" pages numbers inserted by the editors, which start with an offset because the first, blank or almost empty -pages at the begining of each book do not have a number and which sometimes have +pages at the beginning of each book do not have a number and which sometimes have gaps when a full-page geographical map is inserted since those are printed separately on a different folio which remains outside of the textual numbering system. The place at which these layout-related elements occur is determined by @@ -760,17 +759,17 @@ by `soprano` when inferring the reading order before segmenting the articles. Encyclopedias are particularly long books, spanning numerous tomes and containing several tenths of thousands of articles. The *Encyclopédie* comprises -over 74k articles and *La Grande Encyclopédie* certainly more 100k (the latest +over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest version produced by `soprano` produced 160k articles, but their segmentation is -still not perfect and if some article begining remain undetected, all the very +still not perfect and if some article beginning remain undetected, all the very long and deeply-structured articles are unduly split into many parts, resulting -globally in an over-estimation of the total number). In any case, it consists of +globally in an overestimation of the total number). In any case, it consists of 31 tomes of 1200 pages each. XML-TEI is a very broad tool useful for very different applications. Some elements like `<unclear/>` or `<factuality/>` can encode subtle semantics information (for the second case, adjacent to a notion as elusive as truth) -which require a very deep understanding of a text in its entirety and about +which requires a very deep understanding of a text in its entirety and about which even some human experts may disagree. For these reasons, a central concern in the design of our encoding scheme was to @@ -796,7 +795,7 @@ capital "V." as illustrated above in the article "Gelocus". Although this has not been implemented yet either, we hope to be able to detect and exploit those patterns to correctly encode cross-references. Getting the `target` attributes right is certainly more difficult to achieve and may require -processing the articles in several steps, to firsrt discover all the existing +processing the articles in several steps, to first discover all the existing headwords — and hence article IDs — before trying to match the words following "V." with them. Since our automated encoder handles tomes separately and since references may cross the boundaries of tomes, it cannot wait for the target of a @@ -808,11 +807,11 @@ lexicographers may deem our encoding too shallow, it has the advantage of not requiring to keep too complex datastructures in memory for a long time. The algorithm implementing it in `soprano` outputs elements as soon as it can, for instance the empty elements already discussed above. For articles, it pushes -lines onto a stack and flushes it each time it encounters the begining of the +lines onto a stack and flushes it each time it encounters the beginning of the following article. This allows the amount of memory required to remain reasonable and even lets them be parallelised on most modern machines. Thus, -even taking over 3 mn per tome, the total processing time can be lowered to -around 40 mn for the whole of *La Grande Encyclopédie* instead of over one hour +even taking over three minutes per tome, the total processing time can be lowered to +around forty minutes for the whole of *La Grande Encyclopédie* instead of over one hour and a half. ## Comparison to other approaches @@ -850,9 +849,12 @@ between TEI elements and pushed us to look for different combinations. Another valid approach would have consisted in changing the structure of the inclusion graph itself, that is to say modify the rules. If `<entry/>` is the perfect element to encode article themselves, all that is really missing is the ability -to accomodate nested structures with the `<div/>` element. Generating customized TEI -schemas is made really easy with tools like ROMA[^ROMA], which we used to -preview our change and suggest it to the TEI community. +to accomodate nested structures with the `<div/>` element. This would also have +the advantage of recovering the `<usg/>` and `<xr/>` elements which we have +recognized as useful and which we lose as part of the tradeoff to get nested +sections. Generating customized TEI schemas is made really easy with tools like +ROMA[^ROMA], which we used to preview our change and suggest it to the TEI +community. [^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/) @@ -860,11 +862,15 @@ Despite it not getting a wide adhesion, some suggested it could be used locally within the scope of project DISCO-LGE. However we chose not to do so, partially for the same reasons of interoperability as the previous scenario, but also for reasons of sturdiness in front of future evolutions. Making sure the alternative -schema would remain useful entails to maintain it regenerating it should the -schema format evolve, with the possibility that the tools to edit it changes or +schema would remain useful entails to maintain it, regenerating it should the +schema format evolve, with the risk that the tools to edit it might change or stop being maintained. -# Conclusion +# Conclusion {-} + +- Dictionaries and encyclopedias are different +- The *dictionaries* module is inadequate +- We provide an encoding Despite long discussions and interesting proposals each with strong arguments both in favour of and against them, no consensus could be reached. For one part, each @@ -875,3 +881,5 @@ Beyond the technical need for encodings generic enough to share the corpora within the community and compare the results accross various projects, the above results highlights one aspect of a well-known fact within the community of lexicography: encyclopedias and dictionaries differ on several key aspects + +# Bibliography {-}