Skip to content
Snippets Groups Projects
Forked from Léo Schneider / pseudo_image
Source project has a limited visibility.
user avatar
authored

live-query-wiktextract

This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.

The FLASK app accepts GET request at the url

localhost:5000/simplesearch/<lang>/<word>
localhost:5000/search/<wiktlang>/<wordlang>/<word>/<format>
  • simplesearch returns a non-ascii wikstraktor json formatted entry
    • lang: language both for the wiktionary and the word,
    • word: the wordform to be queried.
  • search returns a json formatted entry
    • <wiktlang>: specifies the language of the desired Wiktionary edition,
    • <wordlang>: the language of the word,
    • <word>: the word itself to be queried.
    • <format>: the format of the output
    • wiktextract or xtr : wiktextract native format
    • wikstraktor or strkt: conversion to wikstraktor format
    • prefix a_ can be used to ensure ascii

Local installation

1. Download dump files

Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs in src/config.py) from https://dumps.wikimedia.org/backup-index.html and place them in the dumps/ directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2.

If multiple timestamped dumpf files per edition are present in the dumps/ directory, the most recent one will be selected automatically.

2. Create a virtual environment

Create and activate a virtual Python environment with an environment manager of your choice. For example:

python3 -m venv lq-w-extr
source lq-w-extr/bin/activate

3. Install dependencies

pip install -r requirements.txt

Since wiktextract and its dependency wikitextprocessor are not regularly published as a Python package, it's a challenge to fix them to a specific version. From requirements.txt, the latest version will always be installed. Attention: This might mean that after reinstalling, the output schema of wiktextract might have slightly changed.

4. Load templates from dump files

Run the script src/load_dumps.py to load the most recent dumpfile (for each supported language) into an sqlite database that will be used by wiktextract.

python src/load_dumps.py

5. Start flask app

flask --app src/app.py run

You might want to use to ensure flask runs in your currently active virtual environment.

python -m flask --app src/app.py run

Using Docker

Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/.

Then performs the two steps:

2. Build image

docker build -t live-query-wiktextract .

3. Run image

docker run -p 5000:80 live-query-wiktextract