live-query-wiktextract
This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.
The FLASK app accepts GET request at the url
localhost:5000/simplesearch/<lang>/<word>
localhost:5000/search/<wiktlang>/<wordlang>/<word>/<format>
-
simplesearch
returns a non-ascii wikstraktor json formatted entry-
lang
: language both for the wiktionary and the word, -
word
: the wordform to be queried.
-
-
search
returns a json formatted entry-
<wiktlang>
: specifies the language of the desired Wiktionary edition, -
<wordlang>
: the language of the word, -
<word>
: the word itself to be queried. -
<format>
: the format of the output -
wiktextract
orxtr
: wiktextract native format -
wikstraktor
orstrkt
: conversion to wikstraktor format - prefix
a_
can be used to ensure ascii
-
Local installation
1. Download dump files
Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs
in src/config.py
) from https://dumps.wikimedia.org/backup-index.html
and place them in the dumps/
directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2
.
If multiple timestamped dumpf files per edition are present in the dumps/
directory, the most recent one will be selected automatically.
2. Create a virtual environment
Create and activate a virtual Python environment with an environment manager of your choice. For example:
python3 -m venv lq-w-extr
source lq-w-extr/bin/activate
3. Install dependencies
pip install -r requirements.txt
Since wiktextract
and its dependency wikitextprocessor
are not regularly published as a Python package, it's a challenge to fix them to a specific version. From requirements.txt
, the latest version will always be installed. Attention: This might mean that after reinstalling, the output schema of wiktextract
might have slightly changed.
4. Congigure server
config.py contains :
- server settings (
host
,port
anddebug
(boolean)) - supported wiktionary language
- working directory (this can be useful if the server is launched by another server using absolute paths to handle virtual environment)
5. Load templates from dump files
Run the script src/load_dumps.py
to load the most recent dumpfile (for each supported wiktionary language) into an sqlite database that will be used by wiktextract
.
python src/load_dumps.py
6. Start flask app
flask --app src/app.py run
You might want to use to ensure flask runs in your currently active virtual environment.
python -m flask --app src/app.py run
You can run directly in your virtual environment using absolute paths (in case another server needs to launch this one in one command), example of such command
sh -c nohup /var/www/live-query-wiktextract/lq-w-extr/bin/python3 /var/www/live-query-wiktextract/src/app.py
Using Docker
Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/
.
Then performs the two steps:
1. Build image
docker build -t live-query-wiktextract .
2. Run image
docker run -p 5000:80 live-query-wiktextract