diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index d61cbde86af179b555267f27c1b18523ceeabc5b..89c0a76fb8cf8455ad89082a4a6d60e60aff0777 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -728,8 +728,69 @@ by `soprano` when inferring the reading order before segmenting the articles. ## The constraints of automated processing +Encyclopedias are particularly long books, spanning numerous tomes and +containing several tenths of thousands of articles. The *Encyclopédie* comprises +over 74k articles and *La Grande Encyclopédie* certainly more 100k (the latest +version produced by `soprano` produced 160k articles, but their segmentation is +still not perfect and if some article begining remain undetected, all the very +long and deeply-structured articles are unduly split into many parts, resulting +globally in an over-estimation of the total number). In any case, it consists of +31 tomes of 1200 pages each. + +XML-TEI is a very broad tool useful for very different applications. Some +elements like `<unclear/>` or `<factuality/>` can encode subtle semantics +information (for the second case, adjacent to a notion as elusive as truth) +which require a very deep understanding of a text in its entirety and about +which even some human experts may disagree. + +For these reasons, a central concern in the design of our encoding scheme was to +remain within the boundaries of information that can be described objectively +and extracted automatically by an algorithm. Most of the tags presented above +contain information about the positions of the elements or their relation to one +another. Those with an additional semantics implication like `<head/>` can be +inferred simply from their position and the frequent use of a special typography +like bold or upper-case characters. + +The case of cross-references is particular and may appear as a counter-example +to the main principle on which our scheme is based. Actually, the process of +linking from an article to another one is so frequent (in dictionaries as well +as in encyclopedias) that it generally escapes the scope of regular discourse to +take a special and often fixed form, inside parenthesis and after a special +token which invites the reader to perform the redirection. In *La Grande +Encyclopédie*, virtually all the redirections (that is, to the extent of our +knowledge, absolutely all of them though of course some special case may exist, +but they are statistically rare enough that we have not found any yet) appear +within parenthesis, and start with the verb "voir" abbreviated as a single, +capital "V." as illustrated above in the article "Gelocus". + +Although this has not been implemented yet either, we hope to be able to detect +and exploit those patterns to correctly encode cross-references. Getting the +`target` attributes right is certainly more difficult to achieve and may require +processing the articles in several steps, to firsrt discover all the existing +headwords — and hence article IDs — before trying to match the words following +"V." with them. Since our automated encoder handles tomes separately and since +references may cross the boundaries of tomes, it cannot wait for the target of a +cross-reference to be discovered by keeping the articles in memory before +outputting them. + +This is in line with the last important aspect of our encoder. If many +lexicographers may deem our encoding too shallow, it has the advantage of not +requiring to keep too complex datastructures in memory for a long time. The +algorithm implementing it in `soprano` outputs elements as soon as it can, for +instance the empty elements already discussed above. For articles, it pushes +lines onto a stack and flushes it each time it encounters the begining of the +following article. This allows the amount of memory required to remain +reasonable and even lets them be parallelised on most modern machines. Thus, +even taking over 3 mn per tome, the total processing time can be lowered to +around 40 mn for the whole of *La Grande Encyclopédie* instead of over one hour +and a half. + ## Comparison to other approaches +Before deciding to give up on the *dictionaries* module and attempting to devise +or own encoding scheme, several scenarios have been considered and compared to +find the most compatible with our . + ### Bend the semantics ### Custom schema