Skip to content
Snippets Groups Projects
Commit d0c9e718 authored by Alice Brenon's avatar Alice Brenon
Browse files

Rework the cross-section references to get rid of 'the above remarks'

parent ae33fd14
No related branches found
No related tags found
No related merge requests found
......@@ -189,7 +189,7 @@ d'Alembert insists on the importance of brevity for a clear definition in the
consider encyclopedias superior to dictionaries but really as a new subgenre
departing from them in terms of purpose.
# The *dictionaries* TEI module
# The *dictionaries* TEI module {#sec:dictionaries-module}
The XML-TEI standard has a modular structure consisting of optional parts each
covering specific needs such as the physical features of a source document, the
......@@ -406,13 +406,13 @@ element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the *core* module (see the case of the
`<ref/>` element above), the *dictionaries* module appears somewhat isolated
from important structural elements like `<head/>` or `<div/>`. Indeed, computing
all the paths from either `<entry/>` or `<sense/>` elements to the latter of
length shorter or equal to 5 by a systematic traversal of the graph yields
exclusively paths (respectively 9042 and 39093 of them) containing either a
`<floatingText/>` or an `<app/>` element. The first one, as its name aptly
rest of the XML-TEI, especially to the *core* module (to which belongs for
example the `<ref/>` element), the *dictionaries* module appears somewhat
isolated from important structural elements like `<head/>` or `<div/>`. Indeed,
computing all the paths from either `<entry/>` or `<sense/>` elements to the
latter of length shorter or equal to 5 by a systematic traversal of the graph
yields exclusively paths (respectively 9042 and 39093 of them) containing either
a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly
suggests, is used to encode text that does not quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a `<body/>` as direct child of
......@@ -433,12 +433,13 @@ structures like `<div/>`.
Studying the content of *La Grande Encyclopédie* and considering several
articles in particular, we identify structures which are specific to
encyclopedias and not compatible with the *dictionaries* module presented above.
We hence conclude that this module is not able to encode arbitrary encyclopedic
content and propose a new fully TEI-compliant encoding scheme remaining outside
of it. We proceed with remarks about the needs of automated encoding processes
and compare our proposal with other strategies to overcome the issues previously
identified with the dedicated module for dictionaries.
encyclopedias and not compatible with the *dictionaries* module presented in the
previous section. We hence conclude that this module is not able to encode
arbitrary encyclopedic content and propose a new fully TEI-compliant encoding
scheme remaining outside of it. We proceed with remarks about the needs of
automated encoding processes and compare our proposal with other strategies to
overcome the issues previously identified with the dedicated module for
dictionaries.
## Idiosynchrasies of encyclopedias
......@@ -455,7 +456,7 @@ system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose
of encoding these indicators. As we have seen above, the elements set is geared
of encoding these indicators. As we have seen, the elements set is geared
towards the words themselves instead of the concept they represent. The closest
tool for what we need is found in the `<usg/>` element used with a specific
`type` attribute set to `dom` for "domain". Indeed several examples from the
......@@ -553,8 +554,6 @@ and `<title/>` — the latter with the possibility to set its `type` attribute t
`sub` — stand out as the best candidates for the semantics condition on the
second element.
#### Candidates in the *dictionaries* module {-}
Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
......@@ -590,8 +589,6 @@ elements could fulfill our purpose, it is a fact than no element in this module
appears as an obvious good solution and a serious hint to keep looking somewhere
else.
#### Widening the search {-}
We hence widen our search to include elements outside the *dictionaries* module
which could be used to encode our sections and subsections, under the same
constraint as before to try and find a composite solution that would remain
......@@ -617,25 +614,26 @@ comment" which appears "out of the main textual stream" whereas the long
developments in articles are the very matter of the text of encyclopedias, not
mere remarks in the margins or at the foot of pages.
## Encoding within the *core* module
The above remarks explain why the *dictionary* module is unable to represent
encyclopedias, where the notion of "meaning" is less central that in
dictionaries and where discourse with nested structures of arbitrary depth can
occur. Even composite encodings using elements outside of the *dictionaries*
module under an `<entry/>` element do not meet our requirements. Since the
*core* module of course accomodates these structures by means of the `<div/>`,
`<head/>` and `<p/>` elements which have the additional advantage of carrying
less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme
using them which we recommend using for other projects aiming at representing
encyclopedias.
To remain consistent with the above remarks we will only concern ourselves with
what happens at the level of each article, right under the `<body/>` element.
Everything related to metadata happens as expected in the file's `<teiHeader/>`
which is well-enough equiped to handle them. In order to present our scheme
throughout the following section we will be progressively encoding a reference
article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo.
## Encoding within the *core* module {#sec:core-module}
The remarks made in section @sec:dictionaries-module explain why the
*dictionary* module is unable to represent encyclopedias, where the notion of
"meaning" is less central that in dictionaries and where discourse with nested
structures of arbitrary depth can occur. Even composite encodings using elements
outside of the *dictionaries* module under an `<entry/>` element do not meet our
requirements. Since the *core* module obviously accomodates these structures by
means of the `<div/>`, `<head/>` and `<p/>` elements which have the additional
advantage of carrying less semantical payload than `<sense/>` or `<def/>` we
devise an encoding scheme using them which we recommend using for other projects
aiming at representing encyclopedias.
To remain consistent with the way we studied the *dictionaries* module we will
only concern ourselves with what happens at the level of each article, right
under the `<body/>` element. Everything related to metadata happens as expected
in the file's `<teiHeader/>` which is well-enough equiped to handle them. In
order to present our scheme throughout the following section we will be
progressively encoding a reference article, "Cathète" from tome 9 reproduced in
Figure @fig:cathete-photo.
![La Grande Encyclopédie, tome 9, article "Cathète" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/cathète_t9.png){#fig:cathete-photo}
......@@ -748,15 +746,14 @@ encoding scheme as demonstrated by Figure @fig:alcala-xml.
![Encoding the beginning of a page in article "Alcala-de-Hénarès"](snippets/alcala.png){#fig:alcala-xml}
The reference implementation for this encoding scheme is the program
soprano
([https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)) developed within the scope of project DISCO-LGE to
automatically identify individual articles in the flow of raw text from the
columns and to encode them into XML-TEI files. Though this software has already
been used to produce the first TEI version of *La Grande Encyclopédie*, it does
not yet follow the above specification perfectly. Figure
@fig:cathete-xml-current shows the encoded version of article "Cathète" it
currently produces:
The reference implementation for this encoding scheme is the program soprano
([https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano))
developed within the scope of project DISCO-LGE to automatically identify
individual articles in the flow of raw text from the columns and to encode them
into XML-TEI files. Though this software has already been used to produce the
first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
the specification we have just described. Figure @fig:cathete-xml-current shows
the encoded version of article "Cathète" it currently produces:
![The current encoding of article "Cathète" produced by `soprano`](snippets/cathète_current.png){#fig:cathete-xml-current}
......@@ -802,11 +799,11 @@ which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented above
contain information about the positions of the elements or their relation to one
another. Those with an additional semantics implication like `<head/>` can be
inferred simply from their position and the frequent use of a special typography
like bold or upper-case characters.
and extracted automatically by an algorithm. Most of the tags presented in
section @sec:core-module contain information about the positions of the elements
or their relation to one another. Those with an additional semantics implication
like `<head/>` can be inferred simply from their position and the frequent use
of a special typography like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of
......@@ -818,7 +815,7 @@ Encyclopédie*, virtually all the redirections (that is, to the extent of our
knowledge, absolutely all of them though of course some special case may exist,
but they are statistically rare enough that we have not found any yet) appear
within parenthesis, and start with the verb "voir" abbreviated as a single,
capital "V." as illustrated above in the article "Gelocus" (see again Figure
capital "V." as illustrated in the article "Gelocus" (see again Figure
@fig:gelocus-photo).
Although this has not been implemented yet either, we hope to be able to detect
......@@ -834,10 +831,10 @@ outputting them.
This is in line with the last important aspect of our encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can, for
instance the empty elements already discussed above. For articles, it pushes
lines onto a stack and flushes it each time it encounters the beginning of the
following article. This allows the amount of memory required to remain
algorithm implementing it in `soprano` outputs elements as soon as it can. This
is immediate for simple elements such as `<pb/>` or `<fw/>`; for articles, it
pushes lines onto a stack and flushes it each time it encounters the beginning
of the following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over three minutes per tome, the total processing time can be
lowered to around forty minutes on a machine with 16Go of RAM for the whole of
......@@ -886,7 +883,7 @@ schema would remain useful entails to maintain it, regenerating it should the
schema format evolve, with the risk that the tools to edit it might change or
stop being maintained.
# Conclusion {-}
# Conclusion
Though they are very close genres and share a common history, we have evidenced
key aspects on which dictionaries and encyclopedias differ. Not only do entries
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment