Skip to content
Snippets Groups Projects
Commit 4cdcc821 authored by Alice Brenon's avatar Alice Brenon
Browse files

First batch of fixes with feedback from Ludo (thanks !!)

parent 77ed690b
No related branches found
No related tags found
No related merge requests found
--- ---
title: The specificities of encoding encyclopedias: towards a new standard ? title: The specificities of encoding encyclopedias: towards a new standard ?
author: Alice BRENON author: Alice BRENON
numbersections: True
header-includes: header-includes:
\usepackage{textalpha} \usepackage{textalpha}
\usepackage{hyperref} \usepackage{hyperref}
...@@ -221,15 +222,18 @@ element to the dictionary module: indeed, although `<body/>` may also contain ...@@ -221,15 +222,18 @@ element to the dictionary module: indeed, although `<body/>` may also contain
`<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of `<entryFree/>` or `<superEntry/>` elements, the former is a relaxed version of
`<entry/>` while the latter is a device to group several related entries `<entry/>` while the latter is a device to group several related entries
together. Both can contain an `<entry/` directly while no obvious inclusion together. Both can contain an `<entry/` directly while no obvious inclusion
exists the other way around. Most (> 96.2%) of the inclusion paths of exists the other way around: most (> 96.2%) of the inclusion paths of
"reasonable" depth (which we define as strictly inferior to 5, that is twice the "reasonable" depth (which we define as strictly inferior to 5, that is twice the
average shortest depth between any two nodes) seem to either include `<figure/>` average shortest depth between any two nodes) either include `<figure/>` or
or `<castList/>`, two elements unrelated to encyclopedia articles in the general `<castList/>`, two very specific elements which should not need to appear in an
case. Hence, not only the semantics conveyed by the documentation but also the article in general, showing that the purpose of `<entry/>` is not to contain an
structure of the elements graph evidence `<entry/>` as the natural top-most `<entryFree/>` or `<superEntry/>`. Hence, not only the semantics conveyed by the
element for an article. documentation but also the structure of the elements graph evidence `<entry/>`
as the natural top-most element for an article. This somewhat contrived example
hopes to further demonstrate the application of a graph-centered approach to
understand the inner workings of the XML-TEI schema.
### Information about the word itself ### Information about the headword itself
Once a block for an article is created, it may contain elements useful to Once a block for an article is created, it may contain elements useful to
represent features such as represent features such as
...@@ -240,9 +244,9 @@ represent features such as ...@@ -240,9 +244,9 @@ represent features such as
form itself for instance, but also information about the categories it belongs form itself for instance, but also information about the categories it belongs
to like `<iType/>` for its inflection class in languages with a declension to like `<iType/>` for its inflection class in languages with a declension
system or `<pos/>` for its part-of-speech system or `<pos/>` for its part-of-speech
- its etymology - its etymology: `<etym/>
- its variants if there is a different spelling in a variety of the language or - its variants if there is a different spelling in a variety of the language or
if it has changed through time if it has changed through time: `<usg/>` (though it is not its only purpose)
All these are examples and by no means an exhaustive list; the complete set All these are examples and by no means an exhaustive list; the complete set
provides the encoder with a toolbox to describe all the information related to provides the encoder with a toolbox to describe all the information related to
...@@ -275,9 +279,10 @@ content associated to the headword by the entry. In a dictionary, that is its ...@@ -275,9 +279,10 @@ content associated to the headword by the entry. In a dictionary, that is its
meaning. meaning.
The `<sense/>` element is a valid child for `<entry/>` and groups together a The `<sense/>` element is a valid child for `<entry/>` and groups together a
definition of the term with `<def/>`, usage examples with `<usg/>` and other definition of the term with `<def/>`, usage examples with `<usg/>` (another use
high-level information such as translations in other languages. Both `<def/>` of this versatile element) and other high-level information such as translations
and `<usg/>` elements may appear directly under the `<entry/>`. in other languages. Both `<def/>` and `<usg/>` elements may appear directly
under the `<entry/>`.
### Structural remarks ### Structural remarks
...@@ -298,7 +303,8 @@ that the *dictionaries* module contains short "leaf" elements like `<pos/>` ...@@ -298,7 +303,8 @@ that the *dictionaries* module contains short "leaf" elements like `<pos/>`
which should not obviously need to admit cycles since one rather expects them to which should not obviously need to admit cycles since one rather expects them to
contain only one word, like `<pos>adj</pos>` in the example given in the contain only one word, like `<pos>adj</pos>` in the example given in the
official documentation. Among those (shortest) cycles, 20 include the `<cit/>` official documentation. Among those (shortest) cycles, 20 include the `<cit/>`
element made to group quotations with a bibliographic reference to their source. element made to group quotations with a bibliographic reference to their source
which should clearly be unnecessary to encode an article in the general case.
Secondly, although we have seen examples of connections from this module to the Secondly, although we have seen examples of connections from this module to the
rest of the XML-TEI, especially to the *core* module (see the case of the rest of the XML-TEI, especially to the *core* module (see the case of the
...@@ -420,11 +426,16 @@ often ...@@ -420,11 +426,16 @@ often
### Currently implemented ### Currently implemented
The reference implementation for this encoding scheme is the program `soprano` The reference implementation for this encoding scheme is the program
developed within the scope of project DISCO-LGE. Though this software is already soprano[^soprano] developed within the scope of project DISCO-LGE to
useful to segment the text of the encyclopedia into articles and encode them automatically identify individual articles in the flow of raw text from the
into XML-TEI, it doesn't yet follow the above specification perfectly. Here is column and to encode them into XML-TEI files. Though this software has already
for instance the encoded version of article "Cathète" currently it produces: been used to produce the first TEI version of *La Grande Encyclopédie*, it
doesn't yet follow the above specification perfectly. Here is for instance the
encoded version of article "Cathète" currently it produces:
[^soprano]:
[https://gitlab.huma-num.fr/disco-lge/soprano](https://gitlab.huma-num.fr/disco-lge/soprano)
![](snippets/cathète_current.png) ![](snippets/cathète_current.png)
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment