Skip to content
Snippets Groups Projects
Commit d0c9e718 authored by Alice Brenon's avatar Alice Brenon
Browse files

Rework the cross-section references to get rid of 'the above remarks'

parent ae33fd14
No related branches found
No related tags found
No related merge requests found
...@@ -189,7 +189,7 @@ d'Alembert insists on the importance of brevity for a clear definition in the ...@@ -189,7 +189,7 @@ d'Alembert insists on the importance of brevity for a clear definition in the
consider encyclopedias superior to dictionaries but really as a new subgenre consider encyclopedias superior to dictionaries but really as a new subgenre
departing from them in terms of purpose. departing from them in terms of purpose.
# The *dictionaries* TEI module # The *dictionaries* TEI module {#sec:dictionaries-module}
The XML-TEI standard has a modular structure consisting of optional parts each The XML-TEI standard has a modular structure consisting of optional parts each
covering specific needs such as the physical features of a source document, the covering specific needs such as the physical features of a source document, the
...@@ -406,13 +406,13 @@ element made to group quotations with a bibliographic reference to their source ...@@ -406,13 +406,13 @@ element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case. which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the *core* module (see the case of the rest of the XML-TEI, especially to the *core* module (to which belongs for
`<ref/>` element above), the *dictionaries* module appears somewhat isolated example the `<ref/>` element), the *dictionaries* module appears somewhat
from important structural elements like `<head/>` or `<div/>`. Indeed, computing isolated from important structural elements like `<head/>` or `<div/>`. Indeed,
all the paths from either `<entry/>` or `<sense/>` elements to the latter of computing all the paths from either `<entry/>` or `<sense/>` elements to the
length shorter or equal to 5 by a systematic traversal of the graph yields latter of length shorter or equal to 5 by a systematic traversal of the graph
exclusively paths (respectively 9042 and 39093 of them) containing either a yields exclusively paths (respectively 9042 and 39093 of them) containing either
`<floatingText/>` or an `<app/>` element. The first one, as its name aptly a `<floatingText/>` or an `<app/>` element. The first one, as its name aptly
suggests, is used to encode text that does not quite fit the regular flow of the suggests, is used to encode text that does not quite fit the regular flow of the
document, as for example in the context of an embedded narrative. Both examples document, as for example in the context of an embedded narrative. Both examples
displayed in the online documentation feature a `<body/>` as direct child of displayed in the online documentation feature a `<body/>` as direct child of
...@@ -433,12 +433,13 @@ structures like `<div/>`. ...@@ -433,12 +433,13 @@ structures like `<div/>`.
Studying the content of *La Grande Encyclopédie* and considering several Studying the content of *La Grande Encyclopédie* and considering several
articles in particular, we identify structures which are specific to articles in particular, we identify structures which are specific to
encyclopedias and not compatible with the *dictionaries* module presented above. encyclopedias and not compatible with the *dictionaries* module presented in the
We hence conclude that this module is not able to encode arbitrary encyclopedic previous section. We hence conclude that this module is not able to encode
content and propose a new fully TEI-compliant encoding scheme remaining outside arbitrary encyclopedic content and propose a new fully TEI-compliant encoding
of it. We proceed with remarks about the needs of automated encoding processes scheme remaining outside of it. We proceed with remarks about the needs of
and compare our proposal with other strategies to overcome the issues previously automated encoding processes and compare our proposal with other strategies to
identified with the dedicated module for dictionaries. overcome the issues previously identified with the dedicated module for
dictionaries.
## Idiosynchrasies of encyclopedias ## Idiosynchrasies of encyclopedias
...@@ -455,7 +456,7 @@ system. Those generally cover a broad range of subjects from scientific ...@@ -455,7 +456,7 @@ system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law. disciplines to litterature, and extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose No element in the *dictionaries* module is explicitely designed for the purpose
of encoding these indicators. As we have seen above, the elements set is geared of encoding these indicators. As we have seen, the elements set is geared
towards the words themselves instead of the concept they represent. The closest towards the words themselves instead of the concept they represent. The closest
tool for what we need is found in the `<usg/>` element used with a specific tool for what we need is found in the `<usg/>` element used with a specific
`type` attribute set to `dom` for "domain". Indeed several examples from the `type` attribute set to `dom` for "domain". Indeed several examples from the
...@@ -553,8 +554,6 @@ and `<title/>` — the latter with the possibility to set its `type` attribute t ...@@ -553,8 +554,6 @@ and `<title/>` — the latter with the possibility to set its `type` attribute t
`sub` — stand out as the best candidates for the semantics condition on the `sub` — stand out as the best candidates for the semantics condition on the
second element. second element.
#### Candidates in the *dictionaries* module {-}
Filtering the content of the module to keep only the elements which can at the Filtering the content of the module to keep only the elements which can at the
same time contain themselves, be included under `<entry/>` and include a `<p/>` same time contain themselves, be included under `<entry/>` and include a `<p/>`
and either the `<head/>` or `<title/>` elements yields absolutely no candidates. and either the `<head/>` or `<title/>` elements yields absolutely no candidates.
...@@ -590,8 +589,6 @@ elements could fulfill our purpose, it is a fact than no element in this module ...@@ -590,8 +589,6 @@ elements could fulfill our purpose, it is a fact than no element in this module
appears as an obvious good solution and a serious hint to keep looking somewhere appears as an obvious good solution and a serious hint to keep looking somewhere
else. else.
#### Widening the search {-}
We hence widen our search to include elements outside the *dictionaries* module We hence widen our search to include elements outside the *dictionaries* module
which could be used to encode our sections and subsections, under the same which could be used to encode our sections and subsections, under the same
constraint as before to try and find a composite solution that would remain constraint as before to try and find a composite solution that would remain
...@@ -617,25 +614,26 @@ comment" which appears "out of the main textual stream" whereas the long ...@@ -617,25 +614,26 @@ comment" which appears "out of the main textual stream" whereas the long
developments in articles are the very matter of the text of encyclopedias, not developments in articles are the very matter of the text of encyclopedias, not
mere remarks in the margins or at the foot of pages. mere remarks in the margins or at the foot of pages.
## Encoding within the *core* module ## Encoding within the *core* module {#sec:core-module}
The above remarks explain why the *dictionary* module is unable to represent The remarks made in section @sec:dictionaries-module explain why the
encyclopedias, where the notion of "meaning" is less central that in *dictionary* module is unable to represent encyclopedias, where the notion of
dictionaries and where discourse with nested structures of arbitrary depth can "meaning" is less central that in dictionaries and where discourse with nested
occur. Even composite encodings using elements outside of the *dictionaries* structures of arbitrary depth can occur. Even composite encodings using elements
module under an `<entry/>` element do not meet our requirements. Since the outside of the *dictionaries* module under an `<entry/>` element do not meet our
*core* module of course accomodates these structures by means of the `<div/>`, requirements. Since the *core* module obviously accomodates these structures by
`<head/>` and `<p/>` elements which have the additional advantage of carrying means of the `<div/>`, `<head/>` and `<p/>` elements which have the additional
less semantical payload than `<sense/>` or `<def/>` we devise an encoding scheme advantage of carrying less semantical payload than `<sense/>` or `<def/>` we
using them which we recommend using for other projects aiming at representing devise an encoding scheme using them which we recommend using for other projects
encyclopedias. aiming at representing encyclopedias.
To remain consistent with the above remarks we will only concern ourselves with To remain consistent with the way we studied the *dictionaries* module we will
what happens at the level of each article, right under the `<body/>` element. only concern ourselves with what happens at the level of each article, right
Everything related to metadata happens as expected in the file's `<teiHeader/>` under the `<body/>` element. Everything related to metadata happens as expected
which is well-enough equiped to handle them. In order to present our scheme in the file's `<teiHeader/>` which is well-enough equiped to handle them. In
throughout the following section we will be progressively encoding a reference order to present our scheme throughout the following section we will be
article, "Cathète" from tome 9 reproduced in Figure @fig:cathete-photo. progressively encoding a reference article, "Cathète" from tome 9 reproduced in
Figure @fig:cathete-photo.
![La Grande Encyclopédie, tome 9, article "Cathète" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/cathète_t9.png){#fig:cathete-photo} ![La Grande Encyclopédie, tome 9, article "Cathète" ([BnF - Gallica](http://ark.bnf.fr/ark:/12148/cb41651490t))](ressources/cathète_t9.png){#fig:cathete-photo}
...@@ -748,15 +746,14 @@ encoding scheme as demonstrated by Figure @fig:alcala-xml. ...@@ -748,15 +746,14 @@ encoding scheme as demonstrated by Figure @fig:alcala-xml.
![Encoding the beginning of a page in article "Alcala-de-Hénarès"](snippets/alcala.png){#fig:alcala-xml} ![Encoding the beginning of a page in article "Alcala-de-Hénarès"](snippets/alcala.png){#fig:alcala-xml}
The reference implementation for this encoding scheme is the program The reference implementation for this encoding scheme is the program soprano
soprano ([https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano))
([https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)) developed within the scope of project DISCO-LGE to developed within the scope of project DISCO-LGE to automatically identify
automatically identify individual articles in the flow of raw text from the individual articles in the flow of raw text from the columns and to encode them
columns and to encode them into XML-TEI files. Though this software has already into XML-TEI files. Though this software has already been used to produce the
been used to produce the first TEI version of *La Grande Encyclopédie*, it does first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
not yet follow the above specification perfectly. Figure the specification we have just described. Figure @fig:cathete-xml-current shows
@fig:cathete-xml-current shows the encoded version of article "Cathète" it the encoded version of article "Cathète" it currently produces:
currently produces:
![The current encoding of article "Cathète" produced by `soprano`](snippets/cathète_current.png){#fig:cathete-xml-current} ![The current encoding of article "Cathète" produced by `soprano`](snippets/cathète_current.png){#fig:cathete-xml-current}
...@@ -802,11 +799,11 @@ which even some human experts may disagree. ...@@ -802,11 +799,11 @@ which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to For these reasons, a central concern in the design of our encoding scheme was to
remain within the boundaries of information that can be described objectively remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented above and extracted automatically by an algorithm. Most of the tags presented in
contain information about the positions of the elements or their relation to one section @sec:core-module contain information about the positions of the elements
another. Those with an additional semantics implication like `<head/>` can be or their relation to one another. Those with an additional semantics implication
inferred simply from their position and the frequent use of a special typography like `<head/>` can be inferred simply from their position and the frequent use
like bold or upper-case characters. of a special typography like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of to the main principle on which our scheme is based. Actually, the process of
...@@ -818,7 +815,7 @@ Encyclopédie*, virtually all the redirections (that is, to the extent of our ...@@ -818,7 +815,7 @@ Encyclopédie*, virtually all the redirections (that is, to the extent of our
knowledge, absolutely all of them though of course some special case may exist, knowledge, absolutely all of them though of course some special case may exist,
but they are statistically rare enough that we have not found any yet) appear but they are statistically rare enough that we have not found any yet) appear
within parenthesis, and start with the verb "voir" abbreviated as a single, within parenthesis, and start with the verb "voir" abbreviated as a single,
capital "V." as illustrated above in the article "Gelocus" (see again Figure capital "V." as illustrated in the article "Gelocus" (see again Figure
@fig:gelocus-photo). @fig:gelocus-photo).
Although this has not been implemented yet either, we hope to be able to detect Although this has not been implemented yet either, we hope to be able to detect
...@@ -834,10 +831,10 @@ outputting them. ...@@ -834,10 +831,10 @@ outputting them.
This is in line with the last important aspect of our encoder. If many This is in line with the last important aspect of our encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can, for algorithm implementing it in `soprano` outputs elements as soon as it can. This
instance the empty elements already discussed above. For articles, it pushes is immediate for simple elements such as `<pb/>` or `<fw/>`; for articles, it
lines onto a stack and flushes it each time it encounters the beginning of the pushes lines onto a stack and flushes it each time it encounters the beginning
following article. This allows the amount of memory required to remain of the following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus, reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over three minutes per tome, the total processing time can be even taking over three minutes per tome, the total processing time can be
lowered to around forty minutes on a machine with 16Go of RAM for the whole of lowered to around forty minutes on a machine with 16Go of RAM for the whole of
...@@ -886,7 +883,7 @@ schema would remain useful entails to maintain it, regenerating it should the ...@@ -886,7 +883,7 @@ schema would remain useful entails to maintain it, regenerating it should the
schema format evolve, with the risk that the tools to edit it might change or schema format evolve, with the risk that the tools to edit it might change or
stop being maintained. stop being maintained.
# Conclusion {-} # Conclusion
Though they are very close genres and share a common history, we have evidenced Though they are very close genres and share a common history, we have evidenced
key aspects on which dictionaries and encyclopedias differ. Not only do entries key aspects on which dictionaries and encyclopedias differ. Not only do entries
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment