diff --git a/ICHLL_Brenon.md b/ICHLL_Brenon.md index 3c5f877381a4133db08501ebc020b713c52b5b8b..7359ac5401580edb49bf1983a422f67cfe30afc3 100644 --- a/ICHLL_Brenon.md +++ b/ICHLL_Brenon.md @@ -1,6 +1,7 @@ --- title: The specificities of encoding encyclopedias: towards a new standard ? author: Alice BRENON +numbersections: True header-includes: \usepackage{textalpha} \usepackage{hyperref} @@ -221,15 +222,18 @@ element to the dictionary module: indeed, although `<body/>` may also contain `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of `<entry/>` while the latter is a device to group several related entries together. Both can contain an `<entry/` directly while no obvious inclusion -exists the other way around. Most (> 96.2%) of the inclusion paths of +exists the other way around: most (> 96.2%) of the inclusion paths of "reasonable" depth (which we define as strictly inferior to 5, that is twice the -average shortest depth between any two nodes) seem to either include `<figure/>` -or `<castList/>`, two elements unrelated to encyclopedia articles in the general -case. Hence, not only the semantics conveyed by the documentation but also the -structure of the elements graph evidence `<entry/>` as the natural top-most -element for an article. +average shortest depth between any two nodes) either include `<figure/>` or +`<castList/>`, two very specific elements which should not need to appear in an +article in general, showing that the purpose of `<entry/>` is not to contain an +`<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the +documentation but also the structure of the elements graph evidence `<entry/>` +as the natural top-most element for an article. This somewhat contrived example +hopes to further demonstrate the application of a graph-centered approach to +understand the inner workings of the XML-TEI schema. -### Information about the word itself +### Information about the headword itself Once a block for an article is created, it may contain elements useful to represent features such as @@ -240,9 +244,9 @@ represent features such as form itself for instance, but also information about the categories it belongs to like `<iType/>` for its inflection class in languages with a declension system or `<pos/>` for its part-of-speech -- its etymology +- its etymology: `<etym/> - its variants if there is a different spelling in a variety of the language or - if it has changed through time + if it has changed through time: `<usg/>` (though it is not its only purpose) All these are examples and by no means an exhaustive list; the complete set provides the encoder with a toolbox to describe all the information related to @@ -275,9 +279,10 @@ content associated to the headword by the entry. In a dictionary, that is its meaning. The `<sense/>` element is a valid child for `<entry/>` and groups together a -definition of the term with `<def/>`, usage examples with `<usg/>` and other -high-level information such as translations in other languages. Both `<def/>` -and `<usg/>` elements may appear directly under the `<entry/>`. +definition of the term with `<def/>`, usage examples with `<usg/>` (another use +of this versatile element) and other high-level information such as translations +in other languages. Both `<def/>` and `<usg/>` elements may appear directly +under the `<entry/>`. ### Structural remarks @@ -298,7 +303,8 @@ that the *dictionaries* module contains short "leaf" elements like `<pos/>` which should not obviously need to admit cycles since one rather expects them to contain only one word, like `<pos>adj</pos>` in the example given in the official documentation. Among those (shortest) cycles, 20 include the `<cit/>` -element made to group quotations with a bibliographic reference to their source. +element made to group quotations with a bibliographic reference to their source +which should clearly be unnecessary to encode an article in the general case. Secondly, although we have seen examples of connections from this module to the rest of the XML-TEI, especially to the *core* module (see the case of the @@ -420,11 +426,16 @@ often ### Currently implemented -The reference implementation for this encoding scheme is the program `soprano` -developed within the scope of project DISCO-LGE. Though this software is already -useful to segment the text of the encyclopedia into articles and encode them -into XML-TEI, it doesn't yet follow the above specification perfectly. Here is -for instance the encoded version of article "Cathète" currently it produces: +The reference implementation for this encoding scheme is the program +soprano[^soprano] developed within the scope of project DISCO-LGE to +automatically identify individual articles in the flow of raw text from the +column and to encode them into XML-TEI files. Though this software has already +been used to produce the first TEI version of *La Grande Encyclopédie*, it +doesn't yet follow the above specification perfectly. Here is for instance the +encoded version of article "Cathète" currently it produces: + +[^soprano]: + [https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano) 