"Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.\n",
"\n",
"We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node."
"We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node."
]
},
{
"cell_type": "code",
"execution_count": null,
...
...
@@ -32,56 +57,72 @@
"metadata": {},
"outputs": [],
"source": [
"def nearestAdjacency(matrix):\n",
"def keepOnlyNearest(matrix):\n",
" m = []\n",
" dimension = len(matrix)\n",
" for i in range(0, dimension):\n",
" link = max([matrix[i][j] for j in range(0, dimension) if j != i])\n",
" if link == 0:\n",
" m.append([])\n",
" if link > 0:\n",
" m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])\n",
" else:\n",
" m.append([j for j in range(0, dimension) if j != i and matrix[i][j] == link])\n",
" m.append([None] * dimension)\n",
" return m"
]
},
{
"cell_type": "markdown",
"id": "8611d521",
"metadata": {},
"source": [
"For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the \"no-loops\" constraint though)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b9c92861",
"id": "ebc4db15",
"metadata": {},
"outputs": [],
"source": [
"def listToMatrix(adjacencyList):\n",
"def allNotNull(matrix):\n",
" dimension = len(matrix)\n",
" m = []\n",
" dimension = len(adjacencyList)\n",
" for i in range(0, dimension):\n",
" m.append(dimension * [0])\n",
" for j in adjacencyList[i]:\n",
" m[i][j] = 1\n",
" m.append([])\n",
" for j in range(0, len(matrix[i])):\n",
" link = matrix[i][j]\n",
" m[i].append(link if j != i and link > 0 else None)\n",
" return m"
]
},
{
"cell_type": "markdown",
"id": "f7cc5565",
"metadata": {},
"source": [
"The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "69d494ab",
"id": "ebe2245e",
"metadata": {},
"outputs": [],
"source": [
"def showGraph(n, ranks, metricsName):\n",
" adjacencyList = nearestAdjacency(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName]))\n",
Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.
We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node.
We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node.
link = max([matrix[i][j] for j in range(0, dimension) if j != i])
if link == 0:
m.append([])
if link > 0:
m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])
else:
m.append([j for j in range(0, dimension) if j != i and matrix[i][j] == link])
m.append([None] * dimension)
return m
```
%% Cell type:code id:b9c92861 tags:
%% Cell type:markdown id:8611d521 tags:
For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the "no-loops" constraint though).
The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`.