diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index e9cf27664982c0052c0e0dcc71f1c67136f0fec8..356d2503d1555a2ca46242e61bd5f8401611b55a 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -77,7 +77,7 @@ was entirely reworked, mildly stating that good encyclopedias are difficult to make because of the amount of knowledge necessary and work needed to keep up with scientific progress instead of calling the effort a parody. It credits Chamber's *Cyclopædia* for being a decent attempt before referring anonymously -though quite explicitely to Diderot and d'Alembert's project by naming the +though quite explicitly to Diderot and d'Alembert's project by naming the collective "Une Société de gens de Lettres" and writing that it started in 1751. Even more importantly, two new entries were added after it: one for the adjective "encyclopédique" and another one for the noun "encyclopédiste", silently admitting @@ -127,15 +127,16 @@ was to digitize and make *La Grande Encyclopédie* available to the scientific community as well as the general public. A previous version was partially available on [Gallica](https://gallica.bnf.fr/services/engine/search/sru?operation=searchRetrieve&version=1.2&collapsing=disabled&query=%28dc.title%20all%20%22La%20Grande%20encyclop%C3%A9die%22%29%20and%20dc.relation%20all%20%22cb377013071%22&rk=42918;4#) -but lacked in quality and had not been fully OCRized. +but lacked in quality and its text had not been fully extracted from the +pictures with an Optical Characters Recognition (OCR) system. # The *dictionaries* TEI module Producing *interoperable* and *reusable* data is paramount for them to be useful -for future other scientific projects. These are the two last key aspects of the +in future other scientific projects. These are the two last key aspects of the [FAIR](https://www.go-fair.org/fair-principles/) principles (*findability*, *accessibility*, *interoperability* and *reusability*) which we strive to -enforce as a guideline for efficient and quality research. It entails using +follow as a guideline for efficient and quality research. It entails using standard formats and a standard for encoding historical texts in the context of digital humanities is XML-TEI, collectively developped by the *Text Encoding Initiative* consortium. It consists in a set of technical specifications under @@ -152,24 +153,83 @@ represent them in a monospace font, in the standard XML autoclosing form within angle brackets and with a slash following the element name like `<div/>` for a [`div` element](https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-div.html). We do not mean by this notation that they cannot contain raw text or other XML -elements. +elements, merely that we are referring to such an element, with all the subtree +that spans from it in the context of a concrete document instance or as an empty +structure when we are considering the abstract element and the rules that govern +its use in relation to other elements or its attributes. ## Content +## A graph problem + +The XML-TEI specification contains 590 elements, which are each documented on +the consortium's website in the online reference pages. With an average of +almost 80 possible child elements (79.91) within any given element, manually +browsing such an massive network can prove quite difficult as the number of +combinations sharply increases with each step. We transform the problem by +representing this network as a directed graph, using elements of XML-TEI as +nodes and placing edges if the destination node may be contained within the +source node according to the schema. + +By iterating several times the operation of moving on that graph along one edge, +that is, by considering the transitive closure of the relation "be connected by +an edge" we define *inclusion paths* which allow us to explore which elements +may be nested under one another. The nodes visited along the way represent the +intermediate XML elements to construct a valid XML tree according to the TEI +schema. Given the top-down semantics of those trees, we call the length of an +inclusion path its *depth*. + +Using classical, well-known methods such as Dijkstra's algorithm (Dijkstra, 1959) +allows us to explore the shortest inclusion paths that exist between elements. +Though a particular caution should be applied because there is no guarantee that +the shortest path is meaningful in general, it at least provides us with an +efficient way to check whether a given element may or not be nested under +another one at all and gives an order of magnitude on the length of the path to +expect. Of course the accuracy of this heuristic decreases as the length of the +elements increases in a perfect graph representing the intended, meaningful path +between two nodes, but this formalism lets us consider elements combinations +rationally and exhaustively by algorithmic means. + +For instance, it lets one find that although `<pos/>` may not be directly +included within `<entry/>` elements to include information about the +part-of-speech of the word that an article defines, the correct way to do so is +through a `<gramGrp/>`. On the other hand, trying to discover the shortest +inclusion path to `<pos/>` from the `<TEI/>` root of the document yields a +`<standOff/>`, an element dedicated to store contextual data that accompanies +but is not part of the text, not unlike an annex, and probably not what we want +in the context of encoding an encyclopedia. A last relevant example on the use +of this approach can be given by querying the shortest inclusion path of a +`<pos/>` under the `<body/>` of the document: it yields an inclusion directly +through `<entryFree/>` (with an inclusion path of length 2), which, unlike +`<entry/>` allows it as a direct child node. Possibly not what we want depending +on the regularity of the articles we are encoding and the existence of other +grammatical information such as `<case/>` or `<gen/>` in languages with an +inflexion system to justify the use of the `<gramGrp/>`, but it gives a good +general idea: `<pos/>` does not need to be nested very deep, it can appear quite +near the "surface" of article entries. + ### The `<entry/>` element The central element of the *dictionaries* module is the `<entry/>` element meant to encode one single entry in a dictionary, that is to say a head word -associated to its definition. Although it may be contained by `<entryFree/>` or -`<superEntry/>` elements which are respectively tools to relax some constraints -on `<entry/>` elements or to group several of them together, it is the +associated to its definition. It is the natural entry point from the `<body/>` +element to the dictionary module: indeed, although `<body/>` may also contain +`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of +`<entry/>` while the latter is a device to group several related entries +together. Both can contain an `<entry/` directly while no obvious inclusion +exists the other way around. Most of the inclusion paths of "reasonable" depth +(which we define to strictly inferior to 5, that is twice the average shortest +depth between any two nodes) seem to either include `<figure/>` + +Once a block for an article is created + +It contain elements useful to represent the features occurring at the begining +of an article such as its written and spoken forms (`<form/>`), a group of +grammatical information (`<gramGrp/>`), that may itself contain as we've seen +above `<case/>`, `<gen/>`, `<number/>` or `<pos/>` to describe the form itself for instance, or ` -- hom -- model.entryPart.top -- model.global -- model.ptrLike -- pc -- sense +All these are quite exhaustive and seem general enough to accomodate any book +structure indexing entries by words. A more # A new standard ? @@ -181,44 +241,101 @@ propose a new encoding scheme. ## Nested structures -### The `<entryFree/>` option +### Example + +### Candidates in the *dictionaries* module + +- `<sense/>` +- `<entryFree/>` +- `<note/>` + +## Encoding within the *core* module + +The above remark explains why the *dictionary* module by itself is unable to +represent encyclopedias, where discourse with nested structures of arbitrary +depth can occur. Since the *core* module of course accomodates these structures +by means of the `<div/>`, `<head/>` and `<p/>` elements, we devise an encoding +scheme using them which we recommend using for other projects aiming at +representing encyclopedias. + +To remain consistent with the above remarks we will only concern ourselves with +what happens at the level of each article, right under the `<body/>` element. +Everything related to metadata happens as expected in the file's `<teiHeader/>` +which is well-enough equiped to handle them. In order to present our scheme +throughout the following section we will be progressively encoding a reference +article, "Cathète" from tome 9. + + + +### The scheme + +Each article is represented by a `<div/>`. We suggest setting an `xml:id` +attribute on it with as value the — unique, or made so by suffixing a number +representing its rank among the various occurrences, even when there's only one +for the sake of regularity — head word of the entry, normalized to lowercase, +stripping spaces and replacing all non-alphanumerical characters by a dash `'-'` +to avoid issues with the XML encoding. + + + +Inside this element should be a `<head/>` enclosing the headword of the article. +The usual sub-`<hi/>` elements are available within `<head/>` if the headword is +highlighted by any special typographic means such as bold, small capitals, etc. +This element should also contain the optional subject indicator within +parenthesis that sometimes accompany the headword, with the appropriate standard +elements like `<persName/>` occurring in biographical articles or `<interp/>` +with a `theme` attribute if the article is given a specific domain in a +taxonomy. + + + +We propose to then wrap each different meaning in a separate `<div/>` with the +`type` attribute set to `sense` to refer to the `<sense/>` element that would've +been used within the *core* module. Each sense should be numbered with the `n` +attribute. + + + +In addition, each line within the article must start with a `<lb/>` to mark its +begining including before the `<head/>` element, which, although a surprising +setup, underlines the fact that in the dense layout of encyclopedias, the +carriage return separating two articles is meaningful. Stating each new line +explicitly also keeps enough information to reconstruct a faithful facsimile but +it also has the advantage of highlighting the fact than even though the +definition is cut from the headword by being in a separate XML element, they +still occur on the same line, which is a typographic choice usually made both in +encyclopedias and dictionaries where space is at a premium. -https://tei-c.org/release/doc/tei-p5-doc/en/html/ref-entryFree.html +Finally, the various sections and sub-sections occurring within the article body +may be nested as usual with `<div/>` and sub-`<div/>`s, filled with `<p/>` for +paragraphs which can each be titled with `<head/>` elements local to each +`<div/>`. -- gLike -- model.entryPart -- model.inter -- model.global -- model.morphLike -- model.phrase + -diff: +But a typical page of an encyclopedia also features peritext elements, giving +information to the reader about the current page number along with the headwords +of the first and last articles appearing on the page. -- hom -+ gLike -- model.entryPart.top -+ model.entryPart -+ model.inter -- model.ptrLike -+ model.morphLike -+ model.phrase -- pc -- sense +Depending -gLike: glyphs… -model.entryPart: non-morphological elements like usage (<usg/>), collocation -(<colloc/>) or <sense/> -model.inter: inter § -model.morphLike: morphological elements -model.phrase: individual words or phrases (within § so still no) +Moreover, the layout is +often -=> + free text, but still no structure ! (<div/>, <p/>…) +### Currently implemented -## The *core* module +The reference implementation for this encoding scheme is the program `soprano` +developed within the scope of project DISCO-LGE. Though this software is already +useful to segment the text of the encyclopedia into articles and encode them +into XML-TEI, it doesn't yet follow the above specification perfectly. Here is +for instance the encoded version of article "Cathète" currently it produces: -### Implemented + -### Left-overs +The headword detection system is not able to capture the subject indicators yet +so it appears outside of the `<head/>` element. Likewise, since the detection of +titles at the begining of each section isn't complete, no structure analysis is +performed on the content of the article ## The constraints of automated processing diff --git a/Makefile b/Makefile index 47ad2b8fd5fa75b2cf03bf4a950655377f32c8bb..4d7cc1b6f898c1a55fc30d9510365f5d7d86c7af 100644 --- a/Makefile +++ b/Makefile @@ -9,11 +9,11 @@ ICHLL_Brenon.pdf: $(DEPEDENCIES) ICHLL_Brenon.docx: $(DEPEDENCIES) %.pdf: %.md - pandoc $< -o $@ + LANG=fr_FR.UTF-8 pandoc $< -o $@ %.png: %.pdf pdftocairo -png -singlefile -r 400 $^ $(basename $@) %.docx: %.md - pandoc $< -o $@ + LANG=fr_FR.UTF-8 pandoc $< -o $@ diff --git "a/ressources/cath\303\250te_t9.png" "b/ressources/cath\303\250te_t9.png" new file mode 100644 index 0000000000000000000000000000000000000000..1b911ea0baaf67fc76e9ce11827c5ab68059ebe0 Binary files /dev/null and "b/ressources/cath\303\250te_t9.png" differ