Snippets Groups Projects

Compare History

git ignore update

Mathieu Loiseau authored 1 year ago

684a616b

Name	Last commit	Last update
dumps
src
.dockerignore
.gitignore
Dockerfile
LICENSE
README.md
requirements.txt

live-query-wiktextract

This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.

The FLASK app accepts GET request at the url

localhost:5000/simplesearch/<lang>/<word>
localhost:5000/search/<wiktlang>/<wordlang>/<word>/<format>

simplesearch returns a non-ascii wikstraktor json formatted entry
- lang: language both for the wiktionary and the word,
- word: the wordform to be queried.
search returns a json formatted entry
- <wiktlang>: specifies the language of the desired Wiktionary edition,
- <wordlang>: the language of the word,
- <word>: the word itself to be queried.
- <format>: the format of the output
- wiktextract or xtr : wiktextract native format
- wikstraktor or strkt: conversion to wikstraktor format
- prefix a_ can be used to ensure ascii

Local installation

1. Download dump files

Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs in src/config.py) from https://dumps.wikimedia.org/backup-index.html and place them in the dumps/ directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2.

If multiple timestamped dumpf files per edition are present in the dumps/ directory, the most recent one will be selected automatically.

2. Create a virtual environment

Create and activate a virtual Python environment with an environment manager of your choice. For example:

python3 -m venv lq-w-extr
source lq-w-extr/bin/activate

3. Install dependencies

pip install -r requirements.txt

Since wiktextract and its dependency wikitextprocessor are not regularly published as a Python package, it's a challenge to fix them to a specific version. From requirements.txt, the latest version will always be installed. Attention: This might mean that after reinstalling, the output schema of wiktextract might have slightly changed.

4. Load templates from dump files

Run the script src/load_dumps.py to load the most recent dumpfile (for each supported language) into an sqlite database that will be used by wiktextract.

python src/load_dumps.py

5. Start flask app

flask --app src/app.py run

You might want to use to ensure flask runs in your currently active virtual environment.

python -m flask --app src/app.py run

Using Docker

Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/.

Then performs the two steps:

2. Build image

docker build -t live-query-wiktextract .

3. Run image

docker run -p 5000:80 live-query-wiktextract