"Then we load the training set into a new data structure called a `Source`, which contains a `pandas` `Dataframe` and a hash computed from the list of exact articles \"coordinates\" (volume and article number, and their order matters) contained in the original tsv file."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "5ad65685",
"metadata": {},
"outputs": [],
"source": [
"source = data.load('training_set')"
]
},
{
"cell_type": "markdown",
"id": "4e958e04",
"metadata": {},
"source": [
"This function rationalises the name of the files containing the confusion matrices to produce."
]
},
{
"cell_type": "code",
"execution_count": 10,
"id": "545bdb4f",
"metadata": {},
"outputs": [],
"source": [
"def preparePath(root, source, n, ranks, metricName):\n",
"Then we only have to loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b39c5be0",
"metadata": {},
"outputs": [],
"source": [
"for n in range(1,4):\n",
" for ranks in [10, 50, 100]:\n",
" vectorizer = topNGrams(source, n, ranks)\n",
" for name in ['colinearity', 'keysIntersection']:\n",
" imagePath = preparePath('.', source, n, ranks, name)\n",
from EDdA.classification import confusionMatrix, metrics, toPNG, topNGrams
import os
```
%% Cell type:markdown id:4c3064ea tags:
Then we load the training set into a new data structure called a `Source`, which contains a `pandas``Dataframe` and a hash computed from the list of exact articles "coordinates" (volume and article number, and their order matters) contained in the original tsv file.
Then we only have to loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`).