Newer
Older
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
{
"cells": [
{
"cell_type": "markdown",
"id": "45d83498",
"metadata": {},
"source": [
"# Similarity graphs\n",
"\n",
"Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.\n",
"\n",
"We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fc7a6e69",
"metadata": {},
"outputs": [],
"source": [
"from EDdA import data\n",
"from EDdA.classification import confusionMatrix, metrics, showGraph, topNGrams\n",
"from IPython.display import Image"
]
},
{
"cell_type": "markdown",
"id": "f3bf7fe2",
"metadata": {},
"source": [
"We first load the articles."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "f49c39b5",
"metadata": {},
"outputs": [],
"source": [
"source = data.load('training_set')"
]
},
{
"cell_type": "markdown",
"id": "8e0b222a",
"metadata": {},
"source": [
"We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3a37bfa1",
"metadata": {},
"outputs": [],
"source": [
"def keepOnlyNearest(matrix):\n",
" m = []\n",
" dimension = len(matrix)\n",
" for i in range(0, dimension):\n",
" link = max([matrix[i][j] for j in range(0, dimension) if j != i])\n",
" if link > 0:\n",
" m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])\n",
" else:\n",
" m.append([None] * dimension)\n",
" return m"
]
},
{
"cell_type": "markdown",
"id": "8611d521",
"metadata": {},
"source": [
"For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the \"no-loops\" constraint though)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebc4db15",
"metadata": {},
"outputs": [],
"source": [
"def allNotNull(matrix):\n",
" dimension = len(matrix)\n",
" m = []\n",
" for i in range(0, dimension):\n",
" m.append([])\n",
" for j in range(0, len(matrix[i])):\n",
" link = matrix[i][j]\n",
" m[i].append(link if j != i and link > 0 else None)\n",
" return m"
]
},
{
"cell_type": "markdown",
"id": "f7cc5565",
"metadata": {},
"source": [
"The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ebe2245e",
"metadata": {},
"outputs": [],
"source": [
"def topNGramsGraph(source, n, ranks, metricsName):\n",
" return Image(showGraph(\n",
" keepOnlyNearest(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName])),\n",
" f'./graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'\n",
" ))"
]
},
{
"cell_type": "markdown",
"id": "78b98423",
"metadata": {},
"source": [
"We can now iterate on our parameters like we previously did for the lexical similarities and generate the corresponding graphs."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3d0f3709",
"metadata": {},
"outputs": [],
"source": [
"for n in range(1, 4):\n",
" for ranks in [10, 50, 100]:\n",
" for name in metrics:\n",
" topNGramsGraph(source, n, ranks, name)"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "/gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}