wikstraktor
A python tool to query the wiktionary and extract structured lexical data.
This experimentally identifies every structured info and merges info from different sources.
Dependencies
This project does depend on python packages.
-
pywikibot
allows to use the mediawiki API -
wikitextparser
can parse mediawiki pages and extract sections, templates and links -
importlib
: to import parser modules -
sqlite3
For logs -
gitpython
for logs -
json
for json use re
Installation
(maybe to be replaced by an automation of some sort, using a virtual environment might be better, see server version)
Basic version
python3 -m venv wikstraktorenv #optional for basic version
. wikstraktorenv/bin/activate #activate environment (optional)
pip install -r requirements.txt
Wikstraktor Server
If you want wikstraktor as a server, you need to install flask and flask-cors — to allow other domains to query —, and best practice is to do so in a virtual environment.
The following commands are extracted from the aforementionned documentation, it is probably more secure to click on the link and follow the modules documentation :
python3 -m venv wikstraktorenv #create wikstraktorenv environment
. wikstraktorenv/bin/activate #activate environment
pip install -r server_requirements.txt
Specific user
you can install it for a specific, without an environment (even if it is not recommended)
pip install -r requirements.txt
if you are an administrator you can install it for some other user…
sudo -H -u otherUser pip install -r requirements.txt
NB : it is better if that user also cloned the repo, otherwise dubious ownership might arise and that use should have the following lines in their .gitconfig
[safe]
directory = /path/to/wikstraktor
Use
Wikstraktor
Python
from wikstraktor import Wikstraktor
f = Wikstraktor.get_instance('fr', 'en') #create a wikstraktor,
# first parameter is the language of the wiki
# second parameter is the language of the word sought for
f.fetch("blue") #fetch an article
str(f) #convert content to json
Bash
usage: wikstraktor.py [-h] [-l LANGUAGE] [-w WIKI_LANGUAGE] [-m MOT]
[-f DESTINATION_FILE] [-A] [-C] [-n] [-r] [-L LOG_FILE]
Interroger un wiktionnaire
ex :
‣./wikstraktor.py -m blue
‣./wikstraktor.py -m blue -f blue.json -AC
‣./wikstraktor.py -l en -w fr -m yellow -L /var/log/wikstraktor.sqlite
‣./wikstraktor.py -l en -w fr -m blue -f blue.json -n -ACr
options:
-h, --help show this help message and exit
-l LANGUAGE, --language LANGUAGE
la langue du mot
-w WIKI_LANGUAGE, --wiki_language WIKI_LANGUAGE
la langue du wiki
-m MOT, --mot MOT le mot à chercher
-f DESTINATION_FILE, --destination_file DESTINATION_FILE
le fichier dans lequel stocker le résultat
-A, --force_ascii json avec que des caractères ascii
-C, --compact json sans indentation
-n, --no_id json sans id
-r, --follow_redirections
pour suivre les redirections (ex: did → do)
-L LOG_FILE, --log_file LOG_FILE
le fichier sqlite où stocker les log
(bien vérifier que l'utilisateur qui lance le script a
accès en écriture à ce fichier
et au dossier qui le contient)
Wikstraktor Server
The server runs by default on port 5000, you can change that in the wikstraktor_server_config.py
file.
./wikstraktor_server.py
Then there is a very simple API :
-
GET server_url/search/<word>
: Searches the word in the default wiktionary -
GET server_url/search/<wiktlang>/<wordlang>/<word>
: Searches the word In wordlang in the wiktlang wiktionary Both API calls return a json object.
Licence
GPL v3, see LICENSE.md