Skip to content
Snippets Groups Projects
Mathieu Loiseau's avatar
Mathieu Loiseau authored
e6b7f367

live-query-wiktextract

This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.

The FLASK app accepts GET request at the url

localhost:5000/search/<wiktlang>/<wordlang>/<word>

where <wiktlang> specifies the language of the desired Wiktionary edition, <wordlang> the language of the word, and <word> the word itself to be queried. The route returns the extracted JSON object for the given query.

Local installation

1. Download dump files

Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs in src/config.py) from https://dumps.wikimedia.org/backup-index.html and place them in the dumps/ directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2.

If multiple timestamped dumpf files per edition are present in the dumps/ directory, the most recent one will be selected automatically.

2. Create a virtual environment

Create and activate a virtual Python environment with an environment manager of your choice. For example:

virtualenv live-query-wiktextract
source live-query-wiktextract/bin/activate

3. Install dependencies

pip install -r requirements.txt

Since wiktextract and its dependency wikitextprocessor are not regularly published as a Python package, it's a challenge to fix them to a specific version. From requirements.txt, the latest version will always be installed. Attention: This might mean that after reinstalling, the output schema of wiktextract might have slightly changed.

4. Load templates from dump files

Run the script src/load_dumps.py to load the most recent dumpfile (for each supported language) into an sqlite database that will be used by wiktextract.

python src/load_dumps.py

5. Start flask app

flask --app src/app.py run

You might want to use to ensure flask runs in your currently active virtual environment.

python -m flask --app src/app.py run

Using Docker

Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/.

Then performs the two steps:

2. Build image

docker build -t live-query-wiktextract .

3. Run image

docker run -p 5000:80 live-query-wiktextract