Skip to content
Snippets Groups Projects
Commit 72957cfc authored by Alice Brenon's avatar Alice Brenon
Browse files

Adding the notebook used to generate the similarity graphs

parent 030d12c2
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:45d83498 tags:
# Similarity graphs
Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.
We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node.
%% Cell type:code id:fc7a6e69 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
from EDdA import data
from EDdA.store import preparePath
from EDdA.classification import confusionMatrix, metrics, toPNG, topNGrams
from EDdA.classification import confusionMatrix, metrics, showGraph, topNGrams
from IPython.display import Image
import graphviz
import os
```
%% Cell type:markdown id:f3bf7fe2 tags:
We first load the articles.
%% Cell type:code id:f49c39b5 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
source = data.load('training_set')
```
%% Cell type:markdown id:8e0b222a tags:
We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node.
%% Cell type:code id:3a37bfa1 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
def nearestAdjacency(matrix):
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
def keepOnlyNearest(matrix):
m = []
dimension = len(matrix)
for i in range(0, dimension):
link = max([matrix[i][j] for j in range(0, dimension) if j != i])
if link == 0:
m.append([])
if link > 0:
m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])
else:
m.append([j for j in range(0, dimension) if j != i and matrix[i][j] == link])
m.append([None] * dimension)
return m
```
%% Cell type:code id:b9c92861 tags:
%% Cell type:markdown id:8611d521 tags:
For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the "no-loops" constraint though).
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
def listToMatrix(adjacencyList):
%% Cell type:code id:ebc4db15 tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
def allNotNull(matrix):
dimension = len(matrix)
m = []
dimension = len(adjacencyList)
for i in range(0, dimension):
m.append(dimension * [0])
for j in adjacencyList[i]:
m[i][j] = 1
m.append([])
for j in range(0, len(matrix[i])):
link = matrix[i][j]
m[i].append(link if j != i and link > 0 else None)
return m
```
%% Cell type:code id:69d494ab tags:
%% Cell type:markdown id:f7cc5565 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
def showGraph(n, ranks, metricsName):
adjacencyList = nearestAdjacency(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName]))
g = graphviz.Digraph()
g.graph_attr['rankdir'] = 'LR'
dimension = len(adjacencyList)
for i in range(0, dimension):
g.node(data.domains[i])
for i in range(0, dimension):
for j in adjacencyList[i]:
g.edge(data.domains[i], data.domains[j])
return Image(filename=g.render(
preparePath(f'../graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'),
format='png')
)
The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`.
%% Cell type:code id:ebe2245e tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
def topNGramsGraph(source, n, ranks, metricsName):
return Image(showGraph(
keepOnlyNearest(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName])),
f'./graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'
))
```
%% Cell type:markdown id:78b98423 tags:
We can now iterate on our parameters like we previously did for the lexical similarities and generate the corresponding graphs.
%% Cell type:code id:3d0f3709 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
for n in range(1, 4):
for ranks in [10, 50, 100]:
for name in metrics:
showGraph(n, ranks, name)
topNGramsGraph(source, n, ranks, name)
```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment