"Then we only have to loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`)."
"We loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`)."
]
},
{
...
...
@@ -86,7 +60,7 @@
" for ranks in [10, 50, 100]:\n",
" vectorizer = topNGrams(source, n, ranks)\n",
" for name in ['colinearity', 'keysIntersection']:\n",
" imagePath = preparePath('.', source, n, ranks, name)\n",
from EDdA.classification import confusionMatrix, metrics, toPNG, topNGrams
import os
```
%% Cell type:markdown id:4c3064ea tags:
Then we load the training set into a new data structure called a `Source`, which contains a `pandas``Dataframe` and a hash computed from the list of exact articles "coordinates" (volume and article number, and their order matters) contained in the original tsv file.
Then we only have to loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`).
We loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`).