This notation does not mean to imply that they cannot contain raw text or other
XML elements, it merely denotes such an element, without any additional
...
...
@@ -312,6 +293,11 @@ browsing such an massive network can prove quite difficult as the number of
combinations sharply increases with each step.
The problem can be advantageously transformed by representing this network as a
graph to benefit from the results of graph theory. Classical, well-known methods
such as Dijkstra's algorithm [@dijkstra59] which computes the shortest path
between two nodes in a graph can then be applied
directed graph, using elements of XML-TEI as nodes and placing edges if the
destination node may be contained within the source node according to the
schema. Please note that the word "element" is here used with the same meaning
...
...
@@ -492,9 +478,9 @@ Thus, despite a rather dense internal connectivity, the *dictionaries* module
fails to provide encoders with a device to represent recursively nesting
structures like `<div/>`.
# A new standard ?
# A new standard ? {#sec:new-standard}
Studying the content of *La Grande Encyclopédie* and considering several
Studying the content of *LGE* and considering several
articles in particular, one can identify structures which are specific to
encyclopedias and not compatible with the *dictionaries* module presented in the
previous section. It follows that this module is not able to encode arbitrary
...
...
@@ -512,11 +498,11 @@ of the great variety in terms of editorial choices the most obvious can be
discussed.
The first immediately visible feature that sets encyclopedias apart from
dictionaries and can be found in the *Encyclopédie* as well as in *La Grande
Encyclopédie* is the presence of subject indicators at the beginning of articles
right after the headword which organise them into a domain classification
system. Those generally cover a broad range of subjects from scientific
disciplines to litterature, and extending to political subjects and law.
dictionaries and can be found in the *EDdA* as well as in *LGE* is the presence
of subject indicators at the beginning of articles right after the headword
which organise them into a domain classification system. Those generally cover a
broad range of subjects from scientific disciplines to litterature, and
extending to political subjects and law.
No element in the *dictionaries* module is explicitely designed for the purpose
of encoding these indicators. As section @sec:dictionaries-module illustrates,
...
...
@@ -530,7 +516,7 @@ context in which the word can be used, as expected from the element's name,
"usage". In encyclopedias, if the domain indicator does in certain cases help to
distinguish between several entries sharing the same headword, the concept
itself has evolved beyond this mere distinction. Looking back at the
*Encyclopédie*, the adjective *raisonné* in the rest of the title directly
*EDdA*, the adjective *raisonné* in the rest of the title directly
introduces a notion of structure that links back to the "Systême figuré des
connoissances humaines" [@blanchard2002, p. 1] which schematic structure is
shown in Figure @fig:systeme-figure. The authors have devised a branching system
...
...
@@ -558,14 +544,14 @@ relevant.
Notwithstanding the correct way to represent domains of knowledge, their extent
itself raises concerns regarding the *dictionaries* module. Indeed, among the
vast collection of domains covered in encyclopedias in general and in *La Grande
Encyclopédie* in particular are historical articles and biographies. If the
notion of meaning can appear at least ill-fitting for a text describing a series
of historical events, one may still argue that it groups them into a concept and
associates it to the name of the event. But when it comes to relating the life
of a person, describing their relation to events and other persons comes out
even further from the notion of meaning. Entries such as the one about SANJO
Sanetomi (see Figure @fig:sanjo) do not constitute a *definition*.
vast collection of domains covered in encyclopedias in general and in *LGE* in
particular are historical articles and biographies. If the notion of meaning can
appear at least ill-fitting for a text describing a series of historical events,
one may still argue that it groups them into a concept and associates it to the
name of the event. But when it comes to relating the life of a person,
describing their relation to events and other persons comes out even further
from the notion of meaning. Entries such as the one about SANJO Sanetomi (see
Figure @fig:sanjo) do not constitute a *definition*.
)](ressources/sanjo_t29.png){#fig:sanjo}
...
...
@@ -745,7 +731,7 @@ to the abstract objects that mathematics or poetry are).
For this reason, no particular encoding of the subject indicator is recommended
and it is left open to each particular context: they are often abbreviated so an
`<abbr/>` may apply, in *La Grande Encyclopédie*, biographies are not labeled by
`<abbr/>` may apply, in *LGE*, biographies are not labeled by
a knowledge domain but usually include the first name of the person when it is
known so in that case an element like `<persName/>` is still appropriate. This
choice applied to the same article "Cathète" produces Figure @fig:cathete-xml-1.
...
...
@@ -812,8 +798,7 @@ prone to suffer damages or be misread by the OCR).
Finally there are other TEI elements useful to represent "events" in the flow of
the text, like the beginning of a new column of text or of a new page. Figure
@fig:alcala-photo shows the top left of the last page of the first tome of *La
Grande Encyclopédie* which features peritext elements while marking the
@fig:alcala-photo shows the top left of the last page of the first tome of *LGE* which features peritext elements while marking the
beginning of a new page. The usual appropriate elements (`<pb/>` for page
beginning, `<cb/>` for column beginning) may and should be used with this
encoding scheme as demonstrated by Figure @fig:alcala-xml.
...
...
@@ -827,7 +812,7 @@ The reference implementation for this encoding scheme is the program soprano
developed within the scope of project DISCO-LGE to automatically identify
individual articles in the flow of raw text from the columns and to encode them
into XML-TEI files. Though this software has already been used to produce the
first TEI version of *La Grande Encyclopédie*, it does not follow perfectly yet
first TEI version of *LGE*, it does not follow perfectly yet
the specification described in this chapter. Figure @fig:cathete-xml-current
shows the encoded version of article "Cathète" it currently produces:
...
...
@@ -860,8 +845,8 @@ by `soprano` when inferring the reading order before segmenting the articles.
## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *Encyclopédie* comprises
over 74k articles and *La Grande Encyclopédie* certainly more than 100k (the latest
containing several tenths of thousands of articles. The *EDdA* comprises
over 74k articles and *LGE* certainly more than 100k (the latest
version produced by `soprano` created 160k articles, but their segmentation is
still not perfect and if some article beginning remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
...
...
@@ -886,11 +871,11 @@ to the main principle on which this scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In *La Grande
Encyclopédie*, virtually all the redirections appear within parenthesis (at
least no counter-example has been found within the scope of the project), and
start with the verb "voir" abbreviated as a single, capital "V." as illustrated
in the article "Gelocus" (see again Figure @fig:gelocus-photo).
token which invites the reader to perform the redirection. In *LGE*, virtually
all the redirections appear within parenthesis (at least no counter-example has
been found within the scope of the project), and start with the verb "voir"
abbreviated as a single, capital "V." as illustrated in the article "Gelocus"
(see again Figure @fig:gelocus-photo).
Although this has not been implemented yet either, being able to detect and
exploit those patterns to correctly encode cross-references does not pose any
...
...
@@ -913,7 +898,7 @@ of the following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over three minutes per tome, the total processing time can be
lowered to around forty minutes on a machine with 16Go of RAM for the whole of
*La Grande Encyclopédie* instead of over one hour and a half.