Skip to content
Snippets Groups Projects
Commit 8c91b0a6 authored by Alice Brenon's avatar Alice Brenon
Browse files

Fix typos and other mistakes

parent dadd4cb5
No related branches found
No related tags found
No related merge requests found
......@@ -109,7 +109,7 @@ against the philosophers of the Enlightenment.
The attacks do not remain ignored by Diderot who starts the very definition of
the word "Encyclopédie" in the *Encyclopédie* itself by a strong rebuttal. He
directly dismisses the concerns expressed in the *Dictionnaire de Trevoux* as
mere self-doubt that their authors shouldn't generalise to mankind, then leaves
mere self-doubt that their authors should not generalise to mankind, then leaves
the main point to a latin quote by chancelor Bacon, who argues that a
collaborative work can achieve much more than any talented man could: what could
possibly not be within reach of a single man, within a single lifetime may be
......@@ -117,14 +117,14 @@ achieved by a common effort throughout generations.
History hints that Diderot's opponents took his defense of the feasability of
the project quite seriously, considering the fact that they got the
*Encyclopédie*'s priviledges to be revoked again six years after its publication
*Encyclopédie*'s privileges to be revoked again six years after its publication
was resumed. As a consequence, the remaining ten volumes containing the text of
the articles had to be published illegally until 1765, thanks to the secret
protection of Malesherbes who — despite being head of royal censorship — saved
the manuscripts from destruction. They were printed secretly outside of Paris
and the books were (falsely) labeled as coming from Neufchâtel. Following the
high demand from the booksellers who feared they would lose the money they had
invested in the project, a special priviledge was issued for the volumes
invested in the project, a special privilege was issued for the volumes
containing the plates, which were released publicly from 1762 to 1772.
In any case, in their last edition in 1771 the authors of the *Dictionnaire de
......@@ -143,14 +143,14 @@ knowledge itself.
## A different approach
If encyclopedia are thus historically more recent than dictionaries they also
If encyclopedias are thus historically more recent than dictionaries they also
depart from the latter on their approach. The purpose of dictionaries from their
origin is to collect words, to make an exhaustive inventory of the terms
used in a domain or in a language in order to associate a *definition* to them,
be it a translation in another language for a foreign language dictionary or a
phrase explaining it for other dictionaries. As such, they are collections of
*signs* and remain within the linguistic level of things. Entries in a dictionary
often feature information such as the part of speech, the pronunciation or the
origin is to collect words, to make an exhaustive inventory of the terms used in
a domain or in a language in order to associate a *definition* to them, be it a
translation in another language for a foreign language dictionary or a phrase
explaining it for other dictionaries. As such, they are collections of *signs*
and remain within the linguistic level of things. Entries in a dictionary often
feature information such as the part of speech, the pronunciation or the
etymology of the word they define.
The entry for "Dictionnaire" in the *Encyclopédie* distinguishes between three
......@@ -180,12 +180,12 @@ These are the two last key aspects of the FAIR[^FAIR] principles (*findability*,
as a guideline for efficient and quality research. It entails using standard
formats and a standard for encoding historical texts in the context of digital
humanities is XML-TEI, collectively developped by the *Text Encoding Initiative*
consortium. It consists in a set of technical specifications under the form of
consortium. It publishes a set of technical specifications under the form of
XML schemas, along with a range of tools to handle them and training resources.
[^FAIR]: [https://www.go-fair.org/fair-principles/](https://www.go-fair.org/fair-principles/)
The XML-TEI standard has a modular structure consisting in optional parts each
The XML-TEI standard has a modular structure consisting of optional parts each
covering specific needs such as the physical features of a source document, the
transcription of oral corpora or particular requirements for textual domains
like poetry, or, in our case, dictionaries.
......@@ -239,8 +239,8 @@ the graph (that is an edge from a node to itself) as can be illustrated by the
another one.
The generalisation of this to inclusion paths of any length greater than one is
usually called a cycle in and we may be tempted in our context to refine this
name them to *inclusion cycles*. The `<address/>` element provides us with an
usually called a cycle and we may be tempted in our context to refine this and
name them *inclusion cycles*. The `<address/>` element provides us with an
example for this configuration: although an `<address/>` element may not
directly contain another one, it may contain a `<geogName/>` which, in turn, may
contain a new `<address/>` element. From a graph theory perspective, we can say
......@@ -261,7 +261,7 @@ between two nodes that a human specialist of the TEI framework could build.
This is still very useful when taking into account the fact that TEI modules are
merely "bags" to group the elements and provide hints to human encoders about
the tools they might need but have no implication on the inclusion paths between
element which cross module boundaries freely. The general graph formalism
elements which cross module boundaries freely. The general graph formalism
enables us to describe complex filtering patterns and to implement queries to
look for them among the elements exhaustively by algorithmic means even when the
shortest-path approach is not enough.
......@@ -315,10 +315,10 @@ represent features such as
- its written and spoken forms: `<form/>`
- a group of grammatical information: `<gramGrp/>`, that may itself contain as
we've seen above `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to describe the
form itself for instance, but also information about the categories it belongs
to like `<iType/>` for its inflection class in languages with a declension
system or `<pos/>` for its part-of-speech
previously demonstrated `<case/>`, `<gen/>`, `<number/>` or `<pers/>` to
describe the form itself for instance, but also information about the
categories it belongs to like `<iType/>` for its inflection class in languages
with a declension system or `<pos/>` for its part-of-speech
- its etymology: `<etym/>`
- its variants if there is a different spelling in a variety of the language or
if it has changed through time: `<usg/>` (though it is not its only purpose)
......@@ -389,7 +389,7 @@ all the paths from either `<entry/>` or `<sense/>` elements to the latter of
length shorter or equal to 5 by a systematic traversal of the graph yields
exclusively paths (respectively 9042 and 39093 of them) containing either a
`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
suggests, is used to encode text that doesn't quite fit the regular flow of the
suggests, is used to encode text that does not quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a `<body/>` as direct child of
`<floatingText/>`, neatly separating its content as independent. The purpose of
......@@ -424,8 +424,8 @@ the most obvious.
### Organised knowledge
The first immediately visible feature that sets encyclopedias apart from
dictionaris can be found in the *Encyclopédie* as well in *La Grande
Encyclopédie* is the presence of subject indicators at the begining of articles
dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
Encyclopédie* is the presence of subject indicators at the beginning of articles
right after the headword which organise them into a domain classification
system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law.
......@@ -438,14 +438,14 @@ tool for what we need is found in the `<usg/>` element used with a specific
documentation encode subject indicators very similar to the ones found in
encyclopedias within this element, but the match is not perfect either: all
appear within one of multiple senses, as if to clarify each context in which the
word can be used, as expected from the element's name, "usage". In encyclopedia,
word can be used, as expected from the element's name, "usage". In encyclopedias,
if the domain indicator does in certain cases help to distinguish between
several entries sharing the same headword, the concept itself has evolved beyond
this mere distinction. Looking back at the *Encyclopédie*, the adjective
*raisonné* in the rest of the title directly introduces a notion of structure
that links back to the "Systême figuré des connoissances humaines". The authors
have devised a branching system to classify all knowledge, and the occurrence at
the begining of articles, more than a tool to clear up possible ambiguities also
the beginning of articles, more than a tool to clear up possible ambiguities also
points the reader to the correct place in this mind map.
!["Systême figuré des connoissances humaines", the taxonomy at the heart of the Encyclopédie](ressources/arbre.png){width=200px}
......@@ -455,8 +455,8 @@ module. The `<domain/>` element despite its name belongs exclusively in the
header of a document and focuses on the social context of the text, not on the
knowledge area it covers. The `<interp/>` despite its name is not so much about
labeling something as an interpretation to give to a context (which subject
indicators could be if you consider that, placed at the begining, they are used
to orient the mind frame of the readers towards a particular subject). However,
indicators could be if you consider that, placed at the beginning, they are used
to direct the mind frame of the readers towards a particular subject). However,
the documentation clearly demonstrates it as a tool for annotators of a
document, which text content is not part of the original document but some
additional result of an analysis performed in the context of the encoding, used
......@@ -518,7 +518,7 @@ The nested structure that we have just evidenced demands of course a nesting
structure to accomodate it. More precisely it guides our search of XML elements
by giving us several constraints: we are looking for a pair of elements, the
first representing a (sub)section must be able to include both itself and the
second element, which doesn't have any special constraint in addition to the one
second element, which does not have any special constraint in addition to the one
it shares with the first, which is to have a semantics compatible with our
purpose. In addition, the first element must be able to contain several `<p/>`
elements, `<p/>` being the reference element to encode paragraphs according to
......@@ -647,20 +647,20 @@ For this reason, we do not recommend any special encoding of the subject
indicator but leave it open to each particular context: they are often
abbreviated so an `<abbr/>` may apply, in *La Grande Encyclopédie*, biographies
are not labeled by a knowledge domain but usually include the first name of the
person when it is known so in that case a element like `<persName/>` is still
person when it is known so in that case an element like `<persName/>` is still
appropriate.
![](snippets/cathète_1.png)
We propose to then wrap each different meaning in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would've
been used within the *core* module. Each sense should be numbered with the `n`
attribute.
We then propose to wrap each different meaning in a separate `<div/>` with the
`type` attribute set to `sense` to refer to the `<sense/>` element that would
have been used within the *core* module. Each sense should be numbered with the
`n` attribute.
![](snippets/cathète_2.png)
In addition, each line within the article must start with a `<lb/>` to mark its
begining including before the `<head/>` element, which, although a surprising
beginning including before the `<head/>` element, which, although a surprising
setup, underlines the fact that in the dense layout of encyclopedias, the
carriage return separating two articles is meaningful. Stating each new line
explicitly keeps enough information to reconstruct a faithful facsimile but it
......@@ -709,8 +709,8 @@ recognised (those short elements on the border of pages are the ones typically
prone to suffer damages or be misread by the OCR).
Finally there are other TEI elements useful to represent "events" in the flow of
the text, like the begining of a new column of text or of a new page. The usual
appropriate elements (`<pb/>` for page begining, `<cb/>` for column begining)
the text, like the beginning of a new column of text or of a new page. The usual
appropriate elements (`<pb/>` for page beginning, `<cb/>` for column beginning)
may and should be used with our encoding scheme.
![La Grande Encyclopédie, tome 1, article "Alcala-de-Hénarès"](ressources/last_page_top_left_t1.png){width=350px}
......@@ -724,8 +724,8 @@ soprano[^soprano] developed within the scope of project DISCO-LGE to
automatically identify individual articles in the flow of raw text from the
columns and to encode them into XML-TEI files. Though this software has already
been used to produce the first TEI version of *La Grande Encyclopédie*, it
doesn't yet follow the above specification perfectly. Here is for instance the
encoded version of article "Cathète" currently it produces:
does not yet follow the above specification perfectly. Here is for instance the
encoded version of article "Cathète" it currently produces:
[^soprano]: [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
......@@ -736,12 +736,11 @@ so it appears outside of the `<head/>` element. No work is performed either to
expand abbreviations and encode them as such, or to distinguish between domain
and people names.
Likewise, since the detection of titles at the begining of each section isn't
complete and so no structure analysis is performed on the content of the article
which is placed directly under the article's `<div/>` element at the moment
instead of under a set of nested `<div/>` elements, the topmost having a `type`
attribute of `sense`. The paragraphs are not yet identified and hence not
encoded.
Likewise, since the detection of titles at the beginning of each section is not
complete, no structure analysis can be performed at the moment on the textual
development inside the article and it is left unstructured, directly under the
entry's `<div/>` element instead of under a set of nested `<div/>` elements. The
paragraphs are not yet identified and for this reason not encoded.
However, the figures and their captions are already handled correctly when they
occur. The encoder also keeps track of the current lines, pages, and columns and
......@@ -749,7 +748,7 @@ inserts the corresponding empty elements (`<lb/>`, `<pb/>` or `<cb/>`) and
numbers pages so that the numbering corresponding to the physical pages are
available, as compared to the "high-level" pages numbers inserted by the
editors, which start with an offset because the first, blank or almost empty
pages at the begining of each book do not have a number and which sometimes have
pages at the beginning of each book do not have a number and which sometimes have
gaps when a full-page geographical map is inserted since those are printed
separately on a different folio which remains outside of the textual numbering
system. The place at which these layout-related elements occur is determined by
......@@ -760,17 +759,17 @@ by `soprano` when inferring the reading order before segmenting the articles.
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *Encyclopédie* comprises
over 74k articles and *La Grande Encyclopédie* certainly more 100k (the latest
over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
version produced by `soprano` produced 160k articles, but their segmentation is
still not perfect and if some article begining remain undetected, all the very
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an over-estimation of the total number). In any case, it consists of
globally in an overestimation of the total number). In any case, it consists of
31 tomes of 1200 pages each.
XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
information (for the second case, adjacent to a notion as elusive as truth)
which require a very deep understanding of a text in its entirety and about
which requires a very deep understanding of a text in its entirety and about
which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
......@@ -796,7 +795,7 @@ capital "V." as illustrated above in the article "Gelocus".
Although this has not been implemented yet either, we hope to be able to detect
and exploit those patterns to correctly encode cross-references. Getting the
`target` attributes right is certainly more difficult to achieve and may require
processing the articles in several steps, to firsrt discover all the existing
processing the articles in several steps, to first discover all the existing
headwords — and hence article IDs — before trying to match the words following
"V." with them. Since our automated encoder handles tomes separately and since
references may cross the boundaries of tomes, it cannot wait for the target of a
......@@ -808,11 +807,11 @@ lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can, for
instance the empty elements already discussed above. For articles, it pushes
lines onto a stack and flushes it each time it encounters the begining of the
lines onto a stack and flushes it each time it encounters the beginning of the
following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over 3 mn per tome, the total processing time can be lowered to
around 40 mn for the whole of *La Grande Encyclopédie* instead of over one hour
even taking over three minutes per tome, the total processing time can be lowered to
around forty minutes for the whole of *La Grande Encyclopédie* instead of over one hour
and a half.
## Comparison to other approaches
......@@ -850,9 +849,12 @@ between TEI elements and pushed us to look for different combinations. Another
valid approach would have consisted in changing the structure of the inclusion
graph itself, that is to say modify the rules. If `<entry/>` is the perfect
element to encode article themselves, all that is really missing is the ability
to accomodate nested structures with the `<div/>` element. Generating customized TEI
schemas is made really easy with tools like ROMA[^ROMA], which we used to
preview our change and suggest it to the TEI community.
to accomodate nested structures with the `<div/>` element. This would also have
the advantage of recovering the `<usg/>` and `<xr/>` elements which we have
recognized as useful and which we lose as part of the tradeoff to get nested
sections. Generating customized TEI schemas is made really easy with tools like
ROMA[^ROMA], which we used to preview our change and suggest it to the TEI
community.
[^ROMA]: [https://roma.tei-c.org/](https://roma.tei-c.org/)
......@@ -860,11 +862,15 @@ Despite it not getting a wide adhesion, some suggested it could be used locally
within the scope of project DISCO-LGE. However we chose not to do so, partially
for the same reasons of interoperability as the previous scenario, but also for
reasons of sturdiness in front of future evolutions. Making sure the alternative
schema would remain useful entails to maintain it regenerating it should the
schema format evolve, with the possibility that the tools to edit it changes or
schema would remain useful entails to maintain it, regenerating it should the
schema format evolve, with the risk that the tools to edit it might change or
stop being maintained.
# Conclusion
# Conclusion {-}
- Dictionaries and encyclopedias are different
- The *dictionaries* module is inadequate
- We provide an encoding
Despite long discussions and interesting proposals each with strong arguments both in
favour of and against them, no consensus could be reached. For one part, each
......@@ -875,3 +881,5 @@ Beyond the technical need for encodings generic enough to share the corpora
within the community and compare the results accross various projects, the above
results highlights one aspect of a well-known fact within the community of
lexicography: encyclopedias and dictionaries differ on several key aspects
# Bibliography {-}
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment