Skip to content
Snippets Groups Projects
user avatar
authored

live-query-wiktextract

This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.

The FLASK app accepts GET request at the url

localhost:5000/search/<wiktlang>/<wordlang>/<word>

where <wiktlang> specifies the language of the desired Wiktionary edition, <wordlang> the language of the word, and <word> the word itself to be queried. The route returns the extracted JSON object for the given query.

Local installation

1. Download dump files

Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs in src/config.py) from https://dumps.wikimedia.org/backup-index.html and place them in the dumps/ directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2.

If multiple timestamped dumpf files per edition are present in the dumps/ directory, the most recent one will be selected automatically.

2. Create a virtual environment

Create and activate a virtual Python environment with an environment manager of your choice. For example:

virtualenv live-query-wiktextract
source live-query-wiktextract/bin/activate

3. Install dependencies

pip install -r requirements.txt

Since wiktextract is not regularly published as a Python package, we fix version control to a specific commit. The commit indicated in requirements.txt was used and tested during development.

4. Load templates from dump files

Run the script src/load_templates.py to extract module and template pages from the dumpfile into an sqlite database that will be used by wiktextract.

python src/load_templates.py

5. Start flask app

flask --app src/app.py run

Using Docker

Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/.

Then performs the two steps:

2. Build image

docker build -t live-query-wiktextract .

3. Run image

docker run -p 5000:80 live-query-wiktextract