Skip to content
Snippets Groups Projects
Spectral_analysis.ipynb 6.9 KiB
Newer Older
{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "51c9d98d",
   "metadata": {},
   "source": [
    "# Spectral analysis\n",
    "\n",
    "Drawing the [similarity graphs](Similarity_graphs.ipynb) was one way to explore the resemblances between domains and explain the errors made by our model but it lacks a way to handle the density of links within the graph. For this reason, we turn to algebra to study the dynamics of the corresponding graph."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c6562f65",
   "metadata": {},
   "outputs": [],
   "source": [
    "from EDdA import domains\n",
    "from EDdA.data import shortDomain\n",
    "from EDdA.classification import heatmap, histogram\n",
    "from EDdA.store import preparePath\n",
    "import numpy\n",
    "import numpy.linalg as alg\n",
    "import pandas\n",
    "from sklearn.metrics import confusion_matrix"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "13e0d75e",
   "metadata": {},
   "source": [
    "Let us first load the predictions data from the SGD+TF-IDF model without domain sampling. We will work on the corresponding confusion matrix."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8c62f0df",
   "metadata": {},
   "outputs": [],
   "source": [
    "predictions_sgd_tf_idf_10k = pandas.read_csv('predictions/predictions_test_sgd_tf_idf_s10000.csv', index_col=0)\n",
    "Confusion_matrix_SGD_TFIDF = confusion_matrix(predictions_sgd_tf_idf_10k['labels'], predictions_sgd_tf_idf_10k['predictions'], normalize='true')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2d22b19c",
   "metadata": {},
   "outputs": [],
   "source": [
    "heatmap(Confusion_matrix_SGD_TFIDF, 'confusionMatrix/sgd_tf_idf_s10000.png')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3aa1d427",
   "metadata": {},
   "source": [
    "Since coefficients on each row represent the proportion of articles from the class on the corresponding row which will be predicted by the model to be in the class corresponding to the column, they must sum to 1 : the model doesn't \"lose\" articles, there are as many articles before and after the prediction. In a sense, each row can be viewed as a probability distribution, we say that `Confusion_matrix_SGD_TFIDF` is right-stochastic."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "99c148de",
   "metadata": {},
   "outputs": [],
   "source": [
    "all1Column = [1]*len(Confusion_matrix_SGD_TFIDF)\n",
    "numpy.dot(Confusion_matrix_SGD_TFIDF, all1Column)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "93eba454",
   "metadata": {},
   "source": [
    "Hence `all1Column` is a right eigenvector for `Confusion_matrix_SGD_TFIDF`, associated to the eigenvalue 1, because we've just proved that `Confusion_matrix_SGD_TFIDF × all1Column = all1Column`.\n",
    "\n",
    "Since a matrix and its transpose have the same eigenvalues (because they are the roots of the characteristic polynomial which the matrix and its transpose share because the characteristic polynomial of a matrix can be computed as the determinant of a linear combination of that matrix and the identity matrix, and a matrix and its transpose have the same determinant), we know that `Confusion_matrix_SGD_TFIDF` must also admit an eigenvector on the left for this same eigenvalue, 1 (having a left eigenvector is the same as the transpose having a right eigenvector).\n",
    "\n",
    "Such a vector would be fixed point of `Confusion_matrix_SGD_TFIDF` and would hence represent a distribution of articles left *statistically* unchanged by the model. If we were to sample articles according to the distribution of the vector's coefficients (no matter their absolute count, since eigenvectors are defined up to a scalar factor) and apply the model to predict the classes they belong to, they would on average compensate themselves, so that the number of false negative from, say, class geography (so geography articles predicted outside this class by the model) would equal the false positive from the same class (all the articles from other classes wrongly classified as geography by the model). The number of article to pick from each class to reach this equilibrium depends on the precision and recall of the model for that class but also on the links existing between classes and, as such, gives information about the structure of the graph and this is why computing this stable vector is one way to define the *centrality* measure.\n",
    "\n",
    "Gershgorin's [circle theorem](https://en.wikipedia.org/wiki/Gershgorin_circle_theorem) guarantees that all other eigenvalues will be necessarily lower than or equal to 1 because `Confusion_matrix_SGD_TFIDF` is right-stochastic. There could be several eigenvectors associated to this eigenvalue of 1, corresponding to oscillations in the system, but if there is only one then simply iterating the above matrix will converge to a matrix containing this vector on each row. This is a convenient way to check for this stability and compute the vector at the same time."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e6e31cb6",
   "metadata": {},
   "outputs": [],
   "source": [
    "for n in [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]:\n",
    "    heatmap(\n",
    "            alg.matrix_power(Confusion_matrix_SGD_TFIDF, n),\n",
    "            preparePath(f'iterates/m{n}.png')\n",
    "        )"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fc8d49e7",
   "metadata": {},
   "source": [
    "The matrices look identical from rank 200 on, so iterating converges to our centrality eigenvector, which can simply be read from any row (its coefficients will of course be only approximations but that's good enough)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2c04c29f",
   "metadata": {},
   "outputs": [],
   "source": [
    "centrality = pandas.DataFrame(\n",
    "    zip(map(shortDomain, domains), alg.matrix_power(Confusion_matrix_SGD_TFIDF, 1000)[0]),\n",
    "    columns=['class', 'value']\n",
    ").sort_values(by='value', ascending=False)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e343fa0e",
   "metadata": {},
   "outputs": [],
   "source": [
    "centrality"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "e3c91698",
   "metadata": {},
   "outputs": [],
   "source": [
    "histogram(centrality['class'], centrality.value, 'result/centralities_distribution.png')"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "/gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}