# Lexical similarities

We want to study the n-grams similarities between the domain: a visual way to achieve this is to represent these similarities by confusion matrices, which is the format we used to visualize the errors of our models and will hence provide a base for comparison.

We start by including the EDdA modules from the [project's gitlab](https://gitlab.liris.cnrs.fr/geode/EDdA-Classification).

In [None]:
from EDdA import data
from EDdA.store import preparePath
from EDdA.classification import confusionMatrix, heatmap, metrics, topNGrams
import os

Then we load the training set into a new data structure called a `Source`, which contains a `pandas` `Dataframe` and a hash computed from the list of exact articles "coordinates" (volume and article number, and their order matters) contained in the original tsv file.

In [None]:
source = data.load('training_set')

We loop on the n-gram size (`n`), the number of `ranks` to keep when computing the most frequent ones and the comparison method (the metrics' `name`), generating a PNG confusion matrix for each combination.

In [None]:
for n in range(1,4):
    for ranks in [10, 50, 100]:
        vectorizer = topNGrams(source, n, ranks)
        for name in ['colinearity', 'keysIntersection']:
            imagePath = preparePath(f"confusionMatrix/{source.hash}/{n}grams_top{ranks}_{name}.png")
            heatmap(confusionMatrix(vectorizer, metrics[name]), imagePath)