Adding the notebook used to conduct the spectral analysis

fb30c80f · Alice Brenon · 72957cfc · fb30c80f · fb30c80f
Commit fb30c80f authored 3 years ago by Alice Brenon
--- a/guix.scm
+++ b/guix.scm
-(use-modules ((gnu packages python-science) #:select (python-pandas))
+(use-modules ((gnu packages machine-learning) #:select (python-scikit-learn))
+             ((gnu packages python-science) #:select (python-pandas))
             ((gnu packages python-xyz) #:select (python-matplotlib
                                                  python-nltk
                                                  python-numpy
@@ -27,6 +28,7 @@
            python-nltk
            python-numpy
            python-pandas
+            python-scikit-learn
            python-seaborn
            ))
    (home-page "https://gitlab.liris.cnrs.fr/geode/pyedda")

--- a/notebooks/Spectral_analysis.ipynb
+++ b/notebooks/Spectral_analysis.ipynb
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "id": "51c9d98d",
+   "metadata": {},
+   "source": [
+    "# Spectral analysis\n",
+    "\n",
+    "Drawing the [similarity graphs](Similarity_graphs.ipynb) was one way to explore the resemblances between domains and explain the errors made by our model but it lacks a way to handle the density of links within the graph. For this reason, we turn to algebra to study the dynamics of the corresponding graph."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "c6562f65",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from EDdA import domains\n",
+    "from EDdA.data import shortDomain\n",
+    "from EDdA.classification import heatmap, histogram\n",
+    "from EDdA.store import preparePath\n",
+    "import numpy\n",
+    "import numpy.linalg as alg\n",
+    "import pandas\n",
+    "from sklearn.metrics import confusion_matrix"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "13e0d75e",
+   "metadata": {},
+   "source": [
+    "Let us first load the predictions data from the SGD+TF-IDF model without domain sampling. We will work on the corresponding confusion matrix."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8c62f0df",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "predictions_sgd_tf_idf_10k = pandas.read_csv('predictions/predictions_test_sgd_tf_idf_s10000.csv', index_col=0)\n",
+    "Confusion_matrix_SGD_TFIDF = confusion_matrix(predictions_sgd_tf_idf_10k['labels'], predictions_sgd_tf_idf_10k['predictions'], normalize='true')"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2d22b19c",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "heatmap(Confusion_matrix_SGD_TFIDF, 'confusionMatrix/sgd_tf_idf_s10000.png')"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "3aa1d427",
+   "metadata": {},
+   "source": [
+    "Since coefficients on each row represent the proportion of articles from the class on the corresponding row which will be predicted by the model to be in the class corresponding to the column, they must sum to 1 : the model doesn't \"lose\" articles, there are as many articles before and after the prediction. In a sense, each row can be viewed as a probability distribution, we say that `Confusion_matrix_SGD_TFIDF` is right-stochastic."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "99c148de",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "all1Column = [1]*len(Confusion_matrix_SGD_TFIDF)\n",
+    "numpy.dot(Confusion_matrix_SGD_TFIDF, all1Column)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "93eba454",
+   "metadata": {},
+   "source": [
+    "Hence `all1Column` is a right eigenvector for `Confusion_matrix_SGD_TFIDF`, associated to the eigenvalue 1, because we've just proved that `Confusion_matrix_SGD_TFIDF × all1Column = all1Column`.\n",
+    "\n",
+    "Since a matrix and its transpose have the same eigenvalues (because they are the roots of the characteristic polynomial which the matrix and its transpose share because the characteristic polynomial of a matrix can be computed as the determinant of a linear combination of that matrix and the identity matrix, and a matrix and its transpose have the same determinant), we know that `Confusion_matrix_SGD_TFIDF` must also admit an eigenvector on the left for this same eigenvalue, 1 (having a left eigenvector is the same as the transpose having a right eigenvector).\n",
+    "\n",
+    "Such a vector would be fixed point of `Confusion_matrix_SGD_TFIDF` and would hence represent a distribution of articles left *statistically* unchanged by the model. If we were to sample articles according to the distribution of the vector's coefficients (no matter their absolute count, since eigenvectors are defined up to a scalar factor) and apply the model to predict the classes they belong to, they would on average compensate themselves, so that the number of false negative from, say, class geography (so geography articles predicted outside this class by the model) would equal the false positive from the same class (all the articles from other classes wrongly classified as geography by the model). The number of article to pick from each class to reach this equilibrium depends on the precision and recall of the model for that class but also on the links existing between classes and, as such, gives information about the structure of the graph and this is why computing this stable vector is one way to define the *centrality* measure.\n",
+    "\n",
+    "Gershgorin's [circle theorem](https://en.wikipedia.org/wiki/Gershgorin_circle_theorem) guarantees that all other eigenvalues will be necessarily lower than or equal to 1 because `Confusion_matrix_SGD_TFIDF` is right-stochastic. There could be several eigenvectors associated to this eigenvalue of 1, corresponding to oscillations in the system, but if there is only one then simply iterating the above matrix will converge to a matrix containing this vector on each row. This is a convenient way to check for this stability and compute the vector at the same time."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e6e31cb6",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "for n in [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]:\n",
+    "    heatmap(\n",
+    "            alg.matrix_power(Confusion_matrix_SGD_TFIDF, n),\n",
+    "            preparePath(f'iterates/m{n}.png')\n",
+    "        )"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "fc8d49e7",
+   "metadata": {},
+   "source": [
+    "The matrices look identical from rank 200 on, so iterating converges to our centrality eigenvector, which can simply be read from any row (its coefficients will of course be only approximations but that's good enough)."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "2c04c29f",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "centrality = pandas.DataFrame(\n",
+    "    zip(map(shortDomain, domains), alg.matrix_power(Confusion_matrix_SGD_TFIDF, 1000)[0]),\n",
+    "    columns=['class', 'value']\n",
+    ").sort_values(by='value', ascending=False)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e343fa0e",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "centrality"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "e3c91698",
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "histogram(centrality['class'], centrality.value, 'result/centralities_distribution.png')"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "/gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.9.9"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 5
+}
+%% Cell type:markdown id:51c9d98d tags:
+
+# Spectral analysis
+
+Drawing the [similarity graphs](Similarity_graphs.ipynb) was one way to explore the resemblances between domains and explain the errors made by our model but it lacks a way to handle the density of links within the graph. For this reason, we turn to algebra to study the dynamics of the corresponding graph.
+
+%% Cell type:code id:c6562f65 tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+from EDdA import domains
+from EDdA.data import shortDomain
+from EDdA.classification import heatmap, histogram
+from EDdA.store import preparePath
+import numpy
+import numpy.linalg as alg
+import pandas
+from sklearn.metrics import confusion_matrix
+```
+
+%% Cell type:markdown id:13e0d75e tags:
+
+Let us first load the predictions data from the SGD+TF-IDF model without domain sampling. We will work on the corresponding confusion matrix.
+
+%% Cell type:code id:8c62f0df tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+predictions_sgd_tf_idf_10k = pandas.read_csv('predictions/predictions_test_sgd_tf_idf_s10000.csv', index_col=0)
+Confusion_matrix_SGD_TFIDF = confusion_matrix(predictions_sgd_tf_idf_10k['labels'], predictions_sgd_tf_idf_10k['predictions'], normalize='true')
+```
+
+%% Cell type:code id:2d22b19c tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+heatmap(Confusion_matrix_SGD_TFIDF, 'confusionMatrix/sgd_tf_idf_s10000.png')
+```
+
+%% Cell type:markdown id:3aa1d427 tags:
+
+Since coefficients on each row represent the proportion of articles from the class on the corresponding row which will be predicted by the model to be in the class corresponding to the column, they must sum to 1 : the model doesn't "lose" articles, there are as many articles before and after the prediction. In a sense, each row can be viewed as a probability distribution, we say that `Confusion_matrix_SGD_TFIDF` is right-stochastic.
+
+%% Cell type:code id:99c148de tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+all1Column = [1]*len(Confusion_matrix_SGD_TFIDF)
+numpy.dot(Confusion_matrix_SGD_TFIDF, all1Column)
+```
+
+%% Cell type:markdown id:93eba454 tags:
+
+Hence `all1Column` is a right eigenvector for `Confusion_matrix_SGD_TFIDF`, associated to the eigenvalue 1, because we've just proved that `Confusion_matrix_SGD_TFIDF × all1Column = all1Column`.
+
+Since a matrix and its transpose have the same eigenvalues (because they are the roots of the characteristic polynomial which the matrix and its transpose share because the characteristic polynomial of a matrix can be computed as the determinant of a linear combination of that matrix and the identity matrix, and a matrix and its transpose have the same determinant), we know that `Confusion_matrix_SGD_TFIDF` must also admit an eigenvector on the left for this same eigenvalue, 1 (having a left eigenvector is the same as the transpose having a right eigenvector).
+
+Such a vector would be fixed point of `Confusion_matrix_SGD_TFIDF` and would hence represent a distribution of articles left *statistically* unchanged by the model. If we were to sample articles according to the distribution of the vector's coefficients (no matter their absolute count, since eigenvectors are defined up to a scalar factor) and apply the model to predict the classes they belong to, they would on average compensate themselves, so that the number of false negative from, say, class geography (so geography articles predicted outside this class by the model) would equal the false positive from the same class (all the articles from other classes wrongly classified as geography by the model). The number of article to pick from each class to reach this equilibrium depends on the precision and recall of the model for that class but also on the links existing between classes and, as such, gives information about the structure of the graph and this is why computing this stable vector is one way to define the *centrality* measure.
+
+Gershgorin's [circle theorem](https://en.wikipedia.org/wiki/Gershgorin_circle_theorem) guarantees that all other eigenvalues will be necessarily lower than or equal to 1 because `Confusion_matrix_SGD_TFIDF` is right-stochastic. There could be several eigenvectors associated to this eigenvalue of 1, corresponding to oscillations in the system, but if there is only one then simply iterating the above matrix will converge to a matrix containing this vector on each row. This is a convenient way to check for this stability and compute the vector at the same time.
+
+%% Cell type:code id:e6e31cb6 tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+for n in [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]:
+    heatmap(
+            alg.matrix_power(Confusion_matrix_SGD_TFIDF, n),
+            preparePath(f'iterates/m{n}.png')
+        )
+```
+
+%% Cell type:markdown id:fc8d49e7 tags:
+
+The matrices look identical from rank 200 on, so iterating converges to our centrality eigenvector, which can simply be read from any row (its coefficients will of course be only approximations but that's good enough).
+
+%% Cell type:code id:2c04c29f tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+centrality = pandas.DataFrame(
+    zip(map(shortDomain, domains), alg.matrix_power(Confusion_matrix_SGD_TFIDF, 1000)[0]),
+    columns=['class', 'value']
+).sort_values(by='value', ascending=False)
+```
+
+%% Cell type:code id:e343fa0e tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+centrality
+```
+
+%% Cell type:code id:e3c91698 tags:
+
+``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
+histogram(centrality['class'], centrality.value, 'result/centralities_distribution.png')
+```