diff --git a/README.md b/README.md index 0d480fd5f18dd4d5598f983098259879304195e3..caa04c77e1e087b736aa7480c84c9ad77ae6b15b 100644 --- a/README.md +++ b/README.md @@ -1,44 +1,64 @@ # live-query-wiktextract -## Installation +This project provides a light-weight wrapper to the [wiktextract](https://github.com/tatuylonen/wiktextract) project. Where _wiktextract_ aims to parse whole snapshots of the Wiktionary projects (dump files) into machine-readable JSON, this project allows to efficiently query single pages of different Wiktionary editions. -Wiktionary dump files (`<wiktlang>wiktionary-<date>-pages-articles.xml.bz2`) need to be downloaded manually. +The FLASK app accepts GET request at the url -0. Download Wiktionary dumpfiles from https://dumps.wikimedia.org/ and place at ./dumps/ +``` +localhost:5000/search/<wiktlang>/<wordlang>/<word> +``` + +where `<wiktlang>` specifies the language of the desired Wiktionary edition, `<wordlang>` the language of the word, and `<word>` the word itself to be queried. The route returns the extracted JSON object for the given query. + +## Local installation -### Local python environment +### 1. Download dump files -1. Create a virtual environment +Download the most recent Wiktionary dump files for each supported Wiktionary edition (See `supported_wiktlangs` in `src/config.py`) from `https://dumps.wikimedia.org/backup-index.html` and place them in the `dumps/` directory. The dump files should follow the pattern `<wiktlang>wiktionary-<date>-pages-articles-multistream.xml.bz2`. + +If multiple timestamped dumpf files per edition are present in the `dumps/` directory, the most recent one will be selected automatically. + +### 2. Create a virtual environment + +Create and activate a virtual Python environment with an environment manager of your choice. For example: ``` virtualenv live-query-wiktextract source live-query-wiktextract/bin/activate ``` -2. Install requirements.txt +### 3. Install dependencies ``` pip install -r requirements.txt ``` -_Since wiktextract is not regularly published as a Python package, we fix version control to a specific commit. That commit was used and tested during development._ +_Since wiktextract is not regularly published as a Python package, we fix version control to a specific commit. The commit indicated in requirements.txt was used and tested during development._ + +### 4. Load templates from dump files -### Using Docker +Run the script `src/load_templates.py` to extract module and template pages from the dumpfile into an sqlite database that will be used by `wiktextract`. ``` -docker build -t live-query-wiktextract . +python src/load_templates.py ``` -## Usage +### 5. Start flask app -### With local environment +``` +flask --app src/app.py run +``` +## Using Docker +Alternatively the app can also be containerized using Docker. You still have to provide the dump files in `dumps/`. + +Then performs the two steps: +### 2. Build image ``` -python src/load_templates.py -flask --app src/app.py run --debug +docker build -t live-query-wiktextract . ``` -### Using Docker +### 3. Run image ``` docker run -p 5000:80 live-query-wiktextract