Skip to content
Snippets Groups Projects
Commit 5baae11e authored by Alice Brenon's avatar Alice Brenon
Browse files

Automated processing

parent f535e44e
No related branches found
No related tags found
No related merge requests found
......@@ -728,8 +728,69 @@ by `soprano` when inferring the reading order before segmenting the articles.
## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The *Encyclopédie* comprises
over 74k articles and *La Grande Encyclopédie* certainly more 100k (the latest
version produced by `soprano` produced 160k articles, but their segmentation is
still not perfect and if some article begining remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an over-estimation of the total number). In any case, it consists of
31 tomes of 1200 pages each.
XML-TEI is a very broad tool useful for very different applications. Some
elements like `<unclear/>` or `<factuality/>` can encode subtle semantics
information (for the second case, adjacent to a notion as elusive as truth)
which require a very deep understanding of a text in its entirety and about
which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented above
contain information about the positions of the elements or their relation to one
another. Those with an additional semantics implication like `<head/>` can be
inferred simply from their position and the frequent use of a special typography
like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In *La Grande
Encyclopédie*, virtually all the redirections (that is, to the extent of our
knowledge, absolutely all of them though of course some special case may exist,
but they are statistically rare enough that we have not found any yet) appear
within parenthesis, and start with the verb "voir" abbreviated as a single,
capital "V." as illustrated above in the article "Gelocus".
Although this has not been implemented yet either, we hope to be able to detect
and exploit those patterns to correctly encode cross-references. Getting the
`target` attributes right is certainly more difficult to achieve and may require
processing the articles in several steps, to firsrt discover all the existing
headwords — and hence article IDs — before trying to match the words following
"V." with them. Since our automated encoder handles tomes separately and since
references may cross the boundaries of tomes, it cannot wait for the target of a
cross-reference to be discovered by keeping the articles in memory before
outputting them.
This is in line with the last important aspect of our encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in `soprano` outputs elements as soon as it can, for
instance the empty elements already discussed above. For articles, it pushes
lines onto a stack and flushes it each time it encounters the begining of the
following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over 3 mn per tome, the total processing time can be lowered to
around 40 mn for the whole of *La Grande Encyclopédie* instead of over one hour
and a half.
## Comparison to other approaches
Before deciding to give up on the *dictionaries* module and attempting to devise
or own encoding scheme, several scenarios have been considered and compared to
find the most compatible with our .
### Bend the semantics
### Custom schema
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment