live-query-wiktextract
This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.
The FLASK app accepts GET request at the url
localhost:5000/simplesearch/<lang>/<word>
localhost:5000/search/<wiktlang>/<wordlang>/<word>/<format>
-
simplesearch
returns a non-ascii wikstraktor json formatted entry-
lang
: language both for the wiktionary and the word, -
word
: the wordform to be queried.
-
-
search
returns a json formatted entry-
<wiktlang>
: specifies the language of the desired Wiktionary edition, -
<wordlang>
: the language of the word, -
<word>
: the word itself to be queried. -
<format>
: the format of the output -
wiktextract
orxtr
: wiktextract native format -
wikstraktor
orstrkt
: conversion to wikstraktor format - prefix
a_
can be used to ensure ascii
-
Local installation
1. Download dump files
Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs
in src/config.py
) from https://dumps.wikimedia.org/backup-index.html
and place them in the dumps/
directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2
.
If multiple timestamped dumpf files per edition are present in the dumps/
directory, the most recent one will be selected automatically.
2. Create a virtual environment
Create and activate a virtual Python environment with an environment manager of your choice. For example:
python3 -m venv lq-w-extr
source lq-w-extr/bin/activate
3. Install dependencies
pip install -r requirements.txt
Since wiktextract
and its dependency wikitextprocessor
are not regularly published as a Python package, it's a challenge to fix them to a specific version. From requirements.txt
, the latest version will always be installed. Attention: This might mean that after reinstalling, the output schema of wiktextract
might have slightly changed.
4. Load templates from dump files
Run the script src/load_dumps.py
to load the most recent dumpfile (for each supported language) into an sqlite database that will be used by wiktextract
.
python src/load_dumps.py
5. Start flask app
flask --app src/app.py run
You might want to use to ensure flask runs in your currently active virtual environment.
python -m flask --app src/app.py run
Using Docker
Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/
.
Then performs the two steps:
2. Build image
docker build -t live-query-wiktextract .
3. Run image
docker run -p 5000:80 live-query-wiktextract