Similarity_graphs.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "45d83498",
   "metadata": {},
   "source": [
    "# Similarity graphs\n",
    "\n",
    "Looking at the confusion matrices on the [lexical features](Lexical_similarities.ipynb) provides some interesting insight but it is hard to understand the flux happening in the whole graph considering the (quadratic) number of edges involved.\n",
    "\n",
    "We keep the same nodes but draw the subgraph containing only the strongest edge(s) coming from a node to a different node (that is, we do not consider edges from a node to itself which don't contain information since their values is always 100%: there is no point comparing the n-grams of a domain with themselves). Hence there is an edge between two nodes on the graph we obtain if the top-ranking n-grams of the domain corresponding to the source node are most similar to the ones of the domain corresponding to the target node."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc7a6e69",
   "metadata": {},
   "outputs": [],
   "source": [
    "from EDdA import data\n",
    "from EDdA.classification import confusionMatrix, metrics, showGraph, topNGrams\n",
    "from IPython.display import Image"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f3bf7fe2",
   "metadata": {},
   "source": [
    "We first load the articles."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "f49c39b5",
   "metadata": {},
   "outputs": [],
   "source": [
    "source = data.load('training_set')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8e0b222a",
   "metadata": {},
   "source": [
    "We implement the complex relation we have described above as the function `keepOnlyNearest`. Note in particular that in the case when a node's n-grams are exactly as similar to the n-grams of several other nodes, all corresponding edges are drawn, so contrary to the first intuition conveyed by the description above, several edges may leave one node."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3a37bfa1",
   "metadata": {},
   "outputs": [],
   "source": [
    "def keepOnlyNearest(matrix):\n",
    "    m = []\n",
    "    dimension = len(matrix)\n",
    "    for i in range(0, dimension):\n",
    "        link = max([matrix[i][j] for j in range(0, dimension) if j != i])\n",
    "        if link > 0:\n",
    "            m.append([link if i != j and matrix[i][j] == link else None for j in range(0, dimension)])\n",
    "        else:\n",
    "            m.append([None] * dimension)\n",
    "    return m"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "8611d521",
   "metadata": {},
   "source": [
    "For comparison purposes regarding the relevance of the previous function, we define another adjacency matrix filter for our graphs where this time we keep all edges as long as there is at least some similarity, no matter how small, between two different nodes (we keep the \"no-loops\" constraint though)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebc4db15",
   "metadata": {},
   "outputs": [],
   "source": [
    "def allNotNull(matrix):\n",
    "    dimension = len(matrix)\n",
    "    m = []\n",
    "    for i in range(0, dimension):\n",
    "        m.append([])\n",
    "        for j in range(0, len(matrix[i])):\n",
    "            link = matrix[i][j]\n",
    "            m[i].append(link if j != i and link > 0 else None)\n",
    "    return m"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "f7cc5565",
   "metadata": {},
   "source": [
    "The two functions above provides us with filters to transform the adjacency matrices of graphs that we can apply before rendering them with our graphviz-based primitive `showGraph`."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ebe2245e",
   "metadata": {},
   "outputs": [],
   "source": [
    "def topNGramsGraph(source, n, ranks, metricsName):\n",
    "    return Image(showGraph(\n",
    "            keepOnlyNearest(confusionMatrix(topNGrams(source, n, ranks), metrics[metricsName])),\n",
    "            f'./graph/{source.hash}/{n}grams_top{ranks}_{metricsName}.gv'\n",
    "        ))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "78b98423",
   "metadata": {},
   "source": [
    "We can now iterate on our parameters like we previously did for the lexical similarities and generate the corresponding graphs."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3d0f3709",
   "metadata": {},
   "outputs": [],
   "source": [
    "for n in range(1, 4):\n",
    "    for ranks in [10, 50, 100]:\n",
    "        for name in metrics:\n",
    "            topNGramsGraph(source, n, ranks, name)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "/gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}