Adding the notebook used to generate the similarity graphs

72957cfc · Alice Brenon · 030d12c2 · 72957cfc
Commit 72957cfc authored 3 years ago by Alice Brenon
--- a/notebooks/Domains Graphs.ipynb
+++ b/notebooks/Domains Graphs.ipynb
 {
 "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "45d83498",
+   "metadata": {},
+   "source": [
+    "# Similarity graphs\n",
+    "\n",
+    "Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.\n",
+    "\n",
+    "We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -8,11 +20,16 @@
   "outputs": [],
   "source": [
    "from EDdA import data\n",
-    "from EDdA.store import preparePath\n",
-    "from EDdA.classification import confusionMatrix, metrics, toPNG, topNGrams\n",
-    "from IPython.display import Image\n",
-    "import graphviz\n",
-    "import os"
+    "from EDdA.classification import confusionMatrix, metrics, showGraph, topNGrams\n",
+    "from IPython.display import Image"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "f3bf7fe2",
+   "metadata": {},
+   "source": [
+    "We first load the articles."
   ]
  },
  {
@@ -25,6 +42,14 @@
    "source = data.load('training_set')"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "8e0b222a",
+   "metadata": {},
+   "source": [
+    "We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
@@ -32,56 +57,72 @@
   "metadata": {},
   "outputs": [],
   "source": [
-    "def nearestAdjacency(matrix):\n",
+    "def keepOnlyNearest(matrix):\n",
    "    m = []\n",
    "    dimension = len(matrix)\n",
    "    for i in range(0, dimension):\n",
    "        link = max([matrix[i][j] for j in range(0, dimension) if j != i])\n",
-    "        if link == 0:\n",
-    "            m.append([])\n",
+    "        if link > 0:\n",
+    "            m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])\n",
    "        else:\n",
-    "            m.append([j for j in range(0, dimension) if j != i and matrix[i][j] == link])\n",
+    "            m.append([None] * dimension)\n",
    "    return m"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "8611d521",
+   "metadata": {},
+   "source": [
+    "For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the \"no-loops\" constraint though)."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "b9c92861",
+   "id": "ebc4db15",
   "metadata": {},
   "outputs": [],
   "source": [
-    "def listToMatrix(adjacencyList):\n",
+    "def allNotNull(matrix):\n",
+    "    dimension = len(matrix)\n",
    "    m = []\n",
-    "    dimension = len(adjacencyList)\n",
    "    for i in range(0, dimension):\n",
-    "        m.append(dimension * [0])\n",
-    "        for j in adjacencyList[i]:\n",
-    "            m[i][j] = 1\n",
+    "        m.append([])\n",
+    "        for j in range(0, len(matrix[i])):\n",
+    "            link = matrix[i][j]\n",
+    "            m[i].append(link if j != i and link > 0 else None)\n",
    "    return m"
   ]
  },
+  {
+   "cell_type": "markdown",
+   "id": "f7cc5565",
+   "metadata": {},
+   "source": [
+    "The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`."
+   ]
+  },
  {
   "cell_type": "code",
   "execution_count": null,
-   "id": "69d494ab",
+   "id": "ebe2245e",
   "metadata": {},
   "outputs": [],
   "source": [
-    "def showGraph(n, ranks, metricsName):\n",
-    "    adjacencyList = nearestAdjacency(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName]))\n",
-    "    g = graphviz.Digraph()\n",
-    "    g.graph_attr['rankdir'] = 'LR'\n",
-    "    dimension = len(adjacencyList)\n",
-    "    for i in range(0, dimension):\n",
-    "        g.node(data.domains[i])\n",
-    "    for i in range(0, dimension):\n",
-    "        for j in adjacencyList[i]:\n",
-    "            g.edge(data.domains[i], data.domains[j])\n",
-    "    return Image(filename=g.render(\n",
-    "                    preparePath(f'../graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'),\n",
-    "                    format='png')\n",
-    "                )"
+    "def topNGramsGraph(source, n, ranks, metricsName):\n",
+    "    return Image(showGraph(\n",
+    "            keepOnlyNearest(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName])),\n",
+    "            f'./graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'\n",
+    "        ))"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "78b98423",
+   "metadata": {},
+   "source": [
+    "We can now iterate on our parameters like we previously did for the lexical similarities and generate the corresponding graphs."
   ]
  },
  {
@@ -94,14 +135,14 @@
    "for n in range(1, 4):\n",
    "    for ranks in [10, 50, 100]:\n",
    "        for name in metrics:\n",
-    "            showGraph(n, ranks, name)"
+    "            topNGramsGraph(source, n, ranks, name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
-   "language": "/gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python",
+   "language": "/gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python",
   "name": "python3"
  },
  "language_info": {

+%% Cell type:markdown id:45d83498 tags:
+
+# Similarity graphs
+
+Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.
+
+We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node.
+
 %% Cell type:code id:fc7a6e69 tags:

-``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
 from EDdA import data
-from EDdA.store import preparePath
-from EDdA.classification import confusionMatrix, metrics, toPNG, topNGrams
+from EDdA.classification import confusionMatrix, metrics, showGraph, topNGrams
 from IPython.display import Image
-import graphviz
-import os
 ```

+%% Cell type:markdown id:f3bf7fe2 tags:
+
+We first load the articles.
+
 %% Cell type:code id:f49c39b5 tags:

-``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
 source = data.load('training_set')
 ```

+%% Cell type:markdown id:8e0b222a tags:
+
+We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node.
+
 %% Cell type:code id:3a37bfa1 tags:

-``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
-def nearestAdjacency(matrix):
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+def keepOnlyNearest(matrix):
    m = []
    dimension = len(matrix)
    for i in range(0, dimension):
        link = max([matrix[i][j] for j in range(0, dimension) if j != i])
-        if link == 0:
-            m.append([])
+        if link > 0:
+            m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])
        else:
-            m.append([j for j in range(0, dimension) if j != i and matrix[i][j] == link])
+            m.append([None] * dimension)
    return m
 ```

-%% Cell type:code id:b9c92861 tags:
+%% Cell type:markdown id:8611d521 tags:
+
+For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the "no-loops" constraint though).

-``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
-def listToMatrix(adjacencyList):
+%% Cell type:code id:ebc4db15 tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+def allNotNull(matrix):
+    dimension = len(matrix)
    m = []
-    dimension = len(adjacencyList)
    for i in range(0, dimension):
-        m.append(dimension * [0])
-        for j in adjacencyList[i]:
-            m[i][j] = 1
+        m.append([])
+        for j in range(0, len(matrix[i])):
+            link = matrix[i][j]
+            m[i].append(link if j != i and link > 0 else None)
    return m
 ```

-%% Cell type:code id:69d494ab tags:
+%% Cell type:markdown id:f7cc5565 tags:

-``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
-def showGraph(n, ranks, metricsName):
-    adjacencyList = nearestAdjacency(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName]))
-    g = graphviz.Digraph()
-    g.graph_attr['rankdir'] = 'LR'
-    dimension = len(adjacencyList)
-    for i in range(0, dimension):
-        g.node(data.domains[i])
-    for i in range(0, dimension):
-        for j in adjacencyList[i]:
-            g.edge(data.domains[i], data.domains[j])
-    return Image(filename=g.render(
-                    preparePath(f'../graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'),
-                    format='png')
-                )
+The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`.
+
+%% Cell type:code id:ebe2245e tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+def topNGramsGraph(source, n, ranks, metricsName):
+    return Image(showGraph(
+            keepOnlyNearest(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName])),
+            f'./graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'
+        ))
 ```

+%% Cell type:markdown id:78b98423 tags:
+
+We can now iterate on our parameters like we previously did for the lexical similarities and generate the corresponding graphs.
+
 %% Cell type:code id:3d0f3709 tags:

-``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
 for n in range(1, 4):
    for ranks in [10, 50, 100]:
        for name in metrics:
-            showGraph(n, ranks, name)
+            topNGramsGraph(source, n, ranks, name)
 ```