Skip to content
GitLab
Explore
Sign in
Primary navigation
Search or go to…
Project
I
ICHLL11 Article
Manage
Activity
Members
Labels
Plan
Issues
0
Issue boards
Milestones
Wiki
Code
Merge requests
0
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Package Registry
Model registry
Operate
Environments
Terraform modules
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Alice Brenon
ICHLL11 Article
Commits
5baae11e
Commit
5baae11e
authored
3 years ago
by
Alice Brenon
Browse files
Options
Downloads
Patches
Plain Diff
Automated processing
parent
f535e44e
No related branches found
No related tags found
No related merge requests found
Changes
1
Hide whitespace changes
Inline
Side-by-side
Showing
1 changed file
ICHLL_Brenon.md
+61
-0
61 additions, 0 deletions
ICHLL_Brenon.md
with
61 additions
and
0 deletions
ICHLL_Brenon.md
+
61
−
0
View file @
5baae11e
...
...
@@ -728,8 +728,69 @@ by `soprano` when inferring the reading order before segmenting the articles.
## The constraints of automated processing
Encyclopedias are particularly long books, spanning numerous tomes and
containing several tenths of thousands of articles. The
*Encyclopédie*
comprises
over 74k articles and
*La Grande Encyclopédie*
certainly more 100k (the latest
version produced by
`soprano`
produced 160k articles, but their segmentation is
still not perfect and if some article begining remain undetected, all the very
long and deeply-structured articles are unduly split into many parts, resulting
globally in an over-estimation of the total number). In any case, it consists of
31 tomes of 1200 pages each.
XML-TEI is a very broad tool useful for very different applications. Some
elements like
`<unclear/>`
or
`<factuality/>`
can encode subtle semantics
information (for the second case, adjacent to a notion as elusive as truth)
which require a very deep understanding of a text in its entirety and about
which even some human experts may disagree.
For these reasons, a central concern in the design of our encoding scheme was to
remain within the boundaries of information that can be described objectively
and extracted automatically by an algorithm. Most of the tags presented above
contain information about the positions of the elements or their relation to one
another. Those with an additional semantics implication like
`<head/>`
can be
inferred simply from their position and the frequent use of a special typography
like bold or upper-case characters.
The case of cross-references is particular and may appear as a counter-example
to the main principle on which our scheme is based. Actually, the process of
linking from an article to another one is so frequent (in dictionaries as well
as in encyclopedias) that it generally escapes the scope of regular discourse to
take a special and often fixed form, inside parenthesis and after a special
token which invites the reader to perform the redirection. In
*
La Grande
Encyclopédie
*
, virtually all the redirections (that is, to the extent of our
knowledge, absolutely all of them though of course some special case may exist,
but they are statistically rare enough that we have not found any yet) appear
within parenthesis, and start with the verb "voir" abbreviated as a single,
capital "V." as illustrated above in the article "Gelocus".
Although this has not been implemented yet either, we hope to be able to detect
and exploit those patterns to correctly encode cross-references. Getting the
`target`
attributes right is certainly more difficult to achieve and may require
processing the articles in several steps, to firsrt discover all the existing
headwords — and hence article IDs — before trying to match the words following
"V." with them. Since our automated encoder handles tomes separately and since
references may cross the boundaries of tomes, it cannot wait for the target of a
cross-reference to be discovered by keeping the articles in memory before
outputting them.
This is in line with the last important aspect of our encoder. If many
lexicographers may deem our encoding too shallow, it has the advantage of not
requiring to keep too complex datastructures in memory for a long time. The
algorithm implementing it in
`soprano`
outputs elements as soon as it can, for
instance the empty elements already discussed above. For articles, it pushes
lines onto a stack and flushes it each time it encounters the begining of the
following article. This allows the amount of memory required to remain
reasonable and even lets them be parallelised on most modern machines. Thus,
even taking over 3 mn per tome, the total processing time can be lowered to
around 40 mn for the whole of
*La Grande Encyclopédie*
instead of over one hour
and a half.
## Comparison to other approaches
Before deciding to give up on the
*dictionaries*
module and attempting to devise
or own encoding scheme, several scenarios have been considered and compared to
find the most compatible with our .
### Bend the semantics
### Custom schema
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment