live-query-wiktextract
This project provides a light-weight wrapper to the wiktextract project. Where wiktextract aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions.
The FLASK app accepts GET request at the url
localhost:5000/search/<wiktlang>/<wordlang>/<word>
where <wiktlang>
specifies the language of the desired Wiktionary edition, <wordlang>
the language of the word, and <word>
the word itself to be queried. The route returns the extracted JSON object for the given query.
Local installation
1. Download dump files
Download the most recent Wiktionary dump files for each supported Wiktionary edition (See supported_wiktlangs
in src/config.py
) from https://dumps.wikimedia.org/backup-index.html
and place them in the dumps/
directory. The dump files should follow the pattern <wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2
.
If multiple timestamped dumpf files per edition are present in the dumps/
directory, the most recent one will be selected automatically.
2. Create a virtual environment
Create and activate a virtual Python environment with an environment manager of your choice. For example:
virtualenv live-query-wiktextract
source live-query-wiktextract/bin/activate
3. Install dependencies
pip install -r requirements.txt
Since wiktextract is not regularly published as a Python package, we fix version control to a specific commit. The commit indicated in requirements.txt was used and tested during development.
4. Load templates from dump files
Run the script src/load_templates.py
to extract module and template pages from the dumpfile into an sqlite database that will be used by wiktextract
.
python src/load_templates.py
5. Start flask app
flask --app src/app.py run
Using Docker
Alternatively the app can also be containerized using Docker. You still have to provide the dump files in dumps/
.
Then performs the two steps:
2. Build image
docker build -t live-query-wiktextract .
3. Run image
docker run -p 5000:80 live-query-wiktextract