wikstraktor
A python tool to query the wiktionary and extract structured lexical data.
Dependencies
This project does depend on python packages.
-
pywikibot
allows to use the mediawiki API -
wikitextparser
can parse mediawiki pages and extract sections, templates and links -
importlib
: to import parser modules
Installation
(maybe to be replaced by an automation of some sort, using a virtual environment might be better, see server version)
pip install pywikibot
pip install wikitextparser
pip install gitpython
-
pip install importlib
Optional (for python 2.*, not tested) - run
./setup.py
(used to store wikstraktor version in wiktionary extracts)
Wikstraktor Server
If you want wikstraktor as a server, you need to install flask and flask-cors — to allow other domains to query —, and best practice is to do so in a virtual environment.
The following commands are extracted from the aforementionned documentation, it is probably more secure to click on the link and follow the modules documentation :
python3 -m venv wikstraktorenv #create wikstraktorenv environment
. wikstraktorenv/bin/activate #activate environment
pip install Flask #install Flask
pip install -U flask-cors #install Flask cors
Use
Wikstraktor
Python
from wikstraktor import Wikstraktor
f = Wikstraktor.get_instance('fr', 'en') #create a wikstraktor,
# first parameter is the language of the wiki
# second parameter is the language of the word sought for
f.fetch("blue") #fetch an article
str(f) #convert content to json
Bash
usage: wikstraktor.py [-h] [-l LANGUAGE] [-w WIKI_LANGUAGE] [-m MOT]
[-f DESTINATION_FILE] [-A] [-C]
Interroger un wiktionnaire
ex :
‣./wikstraktor.py -m blue
‣./wikstraktor.py -m blue -f blue.json -A -C
‣./wikstraktor.py -l en -w fr -m blue -f blue.json -A -C
options:
-h, --help show this help message and exit
-l LANGUAGE, --language LANGUAGE
la langue du mot
-w WIKI_LANGUAGE, --wiki_language WIKI_LANGUAGE
la langue du wiki
-m MOT, --mot MOT le mot à chercher
-f DESTINATION_FILE, --destination_file DESTINATION_FILE
le fichier dans lequel stocker le résultat
-A, --force_ascii json avec que des caractères ascii
-C, --compact json sans indentation
Wikstraktor Server
The server runs by default on port 5000, you can change that in the wikstraktor_server_config.py
file.
./wikstraktor_server.py
Then there is a very simple API :
-
GET server_url/search/<word>
: Searches the word in the default wiktionary -
GET server_url/search/<wiktlang>/<wordlang>/<word>
: Searches the word In wordlang in the wiktlang wiktionary Both API calls return a json object.
Licence
TODO but will be open source