Commit fb30c80f authored by Alice Brenon's avatar Alice Brenon
Browse files

Adding the notebook used to conduct the spectral analysis

parent 72957cfc
(use-modules ((gnu packages python-science) #:select (python-pandas))
(use-modules ((gnu packages machine-learning) #:select (python-scikit-learn))
((gnu packages python-science) #:select (python-pandas))
((gnu packages python-xyz) #:select (python-matplotlib
python-nltk
python-numpy
......@@ -27,6 +28,7 @@
python-nltk
python-numpy
python-pandas
python-scikit-learn
python-seaborn
))
(home-page "https://gitlab.liris.cnrs.fr/geode/pyedda")
......
%% Cell type:markdown id:51c9d98d tags:
# Spectral analysis
Drawing the [similarity graphs](Similarity_graphs.ipynb) was one way to explore the resemblances between domains and explain the errors made by our model but it lacks a way to handle the density of links within the graph. For this reason, we turn to algebra to study the dynamics of the corresponding graph.
%% Cell type:code id:c6562f65 tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
from EDdA import domains
from EDdA.data import shortDomain
from EDdA.classification import heatmap, histogram
from EDdA.store import preparePath
import numpy
import numpy.linalg as alg
import pandas
from sklearn.metrics import confusion_matrix
```
%% Cell type:markdown id:13e0d75e tags:
Let us first load the predictions data from the SGD+TF-IDF model without domain sampling. We will work on the corresponding confusion matrix.
%% Cell type:code id:8c62f0df tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
predictions_sgd_tf_idf_10k = pandas.read_csv('predictions/predictions_test_sgd_tf_idf_s10000.csv', index_col=0)
Confusion_matrix_SGD_TFIDF = confusion_matrix(predictions_sgd_tf_idf_10k['labels'], predictions_sgd_tf_idf_10k['predictions'], normalize='true')
```
%% Cell type:code id:2d22b19c tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
heatmap(Confusion_matrix_SGD_TFIDF, 'confusionMatrix/sgd_tf_idf_s10000.png')
```
%% Cell type:markdown id:3aa1d427 tags:
Since coefficients on each row represent the proportion of articles from the class on the corresponding row which will be predicted by the model to be in the class corresponding to the column, they must sum to 1 : the model doesn't "lose" articles, there are as many articles before and after the prediction. In a sense, each row can be viewed as a probability distribution, we say that `Confusion_matrix_SGD_TFIDF` is right-stochastic.
%% Cell type:code id:99c148de tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
all1Column = [1]*len(Confusion_matrix_SGD_TFIDF)
numpy.dot(Confusion_matrix_SGD_TFIDF, all1Column)
```
%% Cell type:markdown id:93eba454 tags:
Hence `all1Column` is a right eigenvector for `Confusion_matrix_SGD_TFIDF`, associated to the eigenvalue 1, because we've just proved that `Confusion_matrix_SGD_TFIDF × all1Column = all1Column`.
Since a matrix and its transpose have the same eigenvalues (because they are the roots of the characteristic polynomial which the matrix and its transpose share because the characteristic polynomial of a matrix can be computed as the determinant of a linear combination of that matrix and the identity matrix, and a matrix and its transpose have the same determinant), we know that `Confusion_matrix_SGD_TFIDF` must also admit an eigenvector on the left for this same eigenvalue, 1 (having a left eigenvector is the same as the transpose having a right eigenvector).
Such a vector would be fixed point of `Confusion_matrix_SGD_TFIDF` and would hence represent a distribution of articles left *statistically* unchanged by the model. If we were to sample articles according to the distribution of the vector's coefficients (no matter their absolute count, since eigenvectors are defined up to a scalar factor) and apply the model to predict the classes they belong to, they would on average compensate themselves, so that the number of false negative from, say, class geography (so geography articles predicted outside this class by the model) would equal the false positive from the same class (all the articles from other classes wrongly classified as geography by the model). The number of article to pick from each class to reach this equilibrium depends on the precision and recall of the model for that class but also on the links existing between classes and, as such, gives information about the structure of the graph and this is why computing this stable vector is one way to define the *centrality* measure.
Gershgorin's [circle theorem](https://en.wikipedia.org/wiki/Gershgorin_circle_theorem) guarantees that all other eigenvalues will be necessarily lower than or equal to 1 because `Confusion_matrix_SGD_TFIDF` is right-stochastic. There could be several eigenvectors associated to this eigenvalue of 1, corresponding to oscillations in the system, but if there is only one then simply iterating the above matrix will converge to a matrix containing this vector on each row. This is a convenient way to check for this stability and compute the vector at the same time.
%% Cell type:code id:e6e31cb6 tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
for n in [1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]:
heatmap(
alg.matrix_power(Confusion_matrix_SGD_TFIDF, n),
preparePath(f'iterates/m{n}.png')
)
```
%% Cell type:markdown id:fc8d49e7 tags:
The matrices look identical from rank 200 on, so iterating converges to our centrality eigenvector, which can simply be read from any row (its coefficients will of course be only approximations but that's good enough).
%% Cell type:code id:2c04c29f tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
centrality = pandas.DataFrame(
zip(map(shortDomain, domains), alg.matrix_power(Confusion_matrix_SGD_TFIDF, 1000)[0]),
columns=['class', 'value']
).sort_values(by='value', ascending=False)
```
%% Cell type:code id:e343fa0e tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
centrality
```
%% Cell type:code id:e3c91698 tags:
``` /gnu/store/fby6l226w8kh2mwkzpjpajmgy0q1kxli-python-wrapper-3.9.9/bin/python
histogram(centrality['class'], centrality.value, 'result/centralities_distribution.png')
```
Markdown is supported
0% or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment