Skip to content
Snippets Groups Projects
Commit d7e9f12b authored by Alice Brenon's avatar Alice Brenon
Browse files

Finish the analysis of scalar products vs. common elements count

parent f24611d4
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id:3401c7f3 tags: %% Cell type:markdown id:3401c7f3 tags:
# So what's wrong with scalar product vs. elements intersection ? # So what's wrong with scalar product vs. elements intersection ?
A short notebook to explain the "shift" in classes comparisons obtained from [that other notebook](Confusion%20Matrix.ipynb). Looking at the matrices produced we observe that the scalar product, which can be seen intuitively as more restrictive a metrics than elements intersection (counting the number of elements in the intersections of two sets can be viewed as a scalar product of vectors where all elements in each set are associated to the integer `1`, so you'd expect the more nuanced scalar product to be systematically inferior to that all-or-nothing metrics), sometimes make classes appear more similar than computing the number of n-grams they have in common. Let's see why, starting from an illustration of the issue at stake: A short notebook to explain the "shift" in classes comparisons obtained from [that other notebook](Confusion%20Matrix.ipynb). Looking at the matrices produced we observe that the scalar product, which can be seen intuitively as more restrictive a metrics than elements intersection (counting the number of elements in the intersections of two sets can be viewed as a scalar product of vectors where all elements in each set are associated to the integer `1`, so you'd expect the more nuanced scalar product to be systematically inferior to that all-or-nothing metrics), sometimes make classes appear more similar than computing the number of n-grams they have in common. Let's see why, starting from an illustration of the issue at stake:
![The pairs (Belles-lettres / Philosophie) and (Architecture / Métiers)](../img/colinearity_vs_keys_intersections.png) ![The pairs (Belles-lettres / Philosophie) and (Architecture / Métiers)](../img/colinearity_vs_keys_intersections.png)
Both pictures are confusion matrices comparing the top 10 2-grams of each domain in the EDdA. The one on the left merely counts the number of 2-grams common between both classes whereas the one on the right uses the scalar product computed on the vectors obtained for each domain, by associating to each one of the most frequent 2-grams found in that domain its number of occurrences. Both pictures are confusion matrices comparing the top 10 2-grams of each domain in the EDdA. The one on the left merely counts the number of 2-grams common between both classes whereas the one on the right uses the scalar product computed on the vectors obtained for each domain, by associating to each one of the most frequent 2-grams found in that domain its number of occurrences.
The cells circled in green, at the intersection between Belles-lettres and Philosophie, presents the expected behaviour: The blue circle is darker on the left, where they elements are simply counted, and appear lighter when the matrix is computed with the metrics derived from the scalar product. The one in purple (Architecture / Métiers), however, is counter-intuitive: the scalar product method makes them appear closer than they were with the elements intersection method. To understand how this is possible, let's look at the top 10 2-grams of the involved classes. The cells circled in green, at the intersection between Belles-lettres and Philosophie, presents the expected behaviour: The blue circle is darker on the left, where they elements are simply counted, and appear lighter when the matrix is computed with the metrics derived from the scalar product. The one in purple (Architecture / Métiers), however, is counter-intuitive: the scalar product method makes them appear closer than they were with the elements intersection method. To understand how this is possible, let's look at the top 10 2-grams of the involved classes.
%% Cell type:code id:7cf51dd1 tags: %% Cell type:code id:7cf51dd1 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
from EDdA import data from EDdA import data
from EDdA.classification import topNGrams from EDdA.classification import topNGrams
``` ```
%% Cell type:code id:30dfceaa tags: %% Cell type:code id:30dfceaa tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
source = data.load('training_set') source = data.load('training_set')
top10_2grams = topNGrams(source, 2, 10) top10_2grams = topNGrams(source, 2, 10)
architecture = 'Architecture' architecture = 'Architecture'
belles_lettres = 'Belles-lettres - Poésie' belles_lettres = 'Belles-lettres - Poésie'
metiers = 'Métiers' metiers = 'Métiers'
philo = 'Philosophie' philo = 'Philosophie'
``` ```
%% Cell type:markdown id:b6f895c4 tags: %% Cell type:markdown id:b6f895c4 tags:
We have everything ready to display the vectors. Let's start with the Belles-lettres: We have everything ready to display the vectors. Let's start with the Belles-lettres:
%% Cell type:code id:ee7a9ab9 tags: %% Cell type:code id:ee7a9ab9 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top10_2grams(belles_lettres) top10_2grams(belles_lettres)
``` ```
%% Output %% Output
{('d.', 'j.'): 485, {('d.', 'j.'): 485,
('s.', 'm.'): 196, ('s.', 'm.'): 196,
('s.', 'f.'): 145, ('s.', 'f.'): 145,
('chez', 'romain'): 71, ('chez', 'romain'): 71,
('a', 'point'): 67, ('a', 'point'): 67,
('-t', '-il'): 62, ('-t', '-il'): 62,
('1', 'degré'): 58, ('1', 'degré'): 58,
('grand', 'nombre'): 57, ('grand', 'nombre'): 57,
('sans', 'doute'): 54, ('sans', 'doute'): 54,
('sou', 'nom'): 54} ('sou', 'nom'): 54}
%% Cell type:markdown id:3dcc0f86 tags: %% Cell type:markdown id:3dcc0f86 tags:
We first notice the occurrences of the 2-gram ('s.', 'm.') and ('s.', 'f.') for "substantif masculin" and "substantif féminin", which occur very frequently at the begining of article. Those are of course unwanted, irrelevant bigrams but they haven't been filtered out (yet ?). Let's look at the Philosophie domain now: We first notice the occurrences of the 2-gram ('s.', 'm.') and ('s.', 'f.') for "substantif masculin" and "substantif féminin", which occur very frequently at the begining of article. Those are of course unwanted, irrelevant bigrams but they haven't been filtered out (yet ?). Let's look at the Philosophie domain now:
%% Cell type:code id:d3397e87 tags: %% Cell type:code id:d3397e87 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top10_2grams(philo) top10_2grams(philo)
``` ```
%% Output %% Output
{('a', 'point'): 191, {('a', 'point'): 191,
('s.', 'f.'): 142, ('s.', 'f.'): 142,
('1', 'degré'): 136, ('1', 'degré'): 136,
('2', 'degré'): 131, ('2', 'degré'): 131,
('-t', '-il'): 131, ('-t', '-il'): 131,
('grand', 'nombre'): 116, ('grand', 'nombre'): 116,
('dieu', 'a'): 100, ('dieu', 'a'): 100,
('sans', 'doute'): 89, ('sans', 'doute'): 89,
('3', 'degré'): 88, ('3', 'degré'): 88,
('d.', 'j.'): 82} ('d.', 'j.'): 82}
%% Cell type:markdown id:41d147a0 tags: %% Cell type:markdown id:41d147a0 tags:
Interestingly enough, the Philosophie domain seems to comprise much fewer masculine substantives, so that ('s.', 'm.') didn't even make the top 10. The ('d.', 'j.') bigram is there too (probably the signature of an author ?), as well as ('-t', '-il'), ('1', 'degré'), ('a', 'point'), ('grand', 'nombre') and ('sans', 'doute'). Quite a populated intersection which account for the relatively dark blue patch in the matrix on the left (7/10). Interestingly enough, the Philosophie domain seems to comprise much fewer masculine substantives, so that ('s.', 'm.') didn't even make the top 10. The ('d.', 'j.') bigram is there too (probably the signature of an author ?), as well as ('-t', '-il'), ('1', 'degré'), ('a', 'point'), ('grand', 'nombre') and ('sans', 'doute'). Quite a populated intersection which account for the relatively dark blue patch in the matrix on the left (7/10).
%% Cell type:code id:320cf92e tags: %% Cell type:code id:320cf92e tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
set(top10_2grams(belles_lettres)).intersection(top10_2grams(philo)) set(top10_2grams(belles_lettres)).intersection(top10_2grams(philo))
``` ```
%% Output %% Output
{('-t', '-il'), {('-t', '-il'),
('1', 'degré'), ('1', 'degré'),
('a', 'point'), ('a', 'point'),
('d.', 'j.'), ('d.', 'j.'),
('grand', 'nombre'), ('grand', 'nombre'),
('s.', 'f.'), ('s.', 'f.'),
('sans', 'doute')} ('sans', 'doute')}
%% Cell type:markdown id:6747e51b tags: %% Cell type:markdown id:6747e51b tags:
Now if we look at their (normalized) scalar product, though, the result is pretty average: Now if we look at their (normalized) scalar product, though, the result is pretty average:
%% Cell type:code id:911d6a17 tags: %% Cell type:code id:911d6a17 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
from EDdA.classification import colinearity from EDdA.classification import colinearity
``` ```
%% Cell type:code id:0fa76ac3 tags: %% Cell type:code id:0fa76ac3 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
colinearity(top10_2grams(belles_lettres), top10_2grams(philo)) colinearity(top10_2grams(belles_lettres), top10_2grams(philo))
``` ```
%% Output %% Output
0.4508503933694939 0.4508503933694939
%% Cell type:markdown id:36532f01 tags: %% Cell type:markdown id:36532f01 tags:
Indeed, if we look at the cardinalities associated to each 2-gram they share, they aren't synchronized but rather in phase opposition instead: Indeed, if we look at the cardinalities associated to each 2-gram they share, they aren't synchronized but rather in phase opposition instead:
- the most frequent in belles_lettres, ('d.', 'j.') (485) is the least frequent in philo (82) - the most frequent in belles_lettres, ('d.', 'j.') (485) is the least frequent in philo (82)
- the most frequent in philo, ('a', 'point') (191) is rather low in belles_lettres (67) - the most frequent in philo, ('a', 'point') (191) is rather low in belles_lettres (67)
- the trend is general to the other 2-grams, except maybe ('s.', 'f.') which scores about the same in both domains - the trend is general to the other 2-grams, except maybe ('s.', 'f.') which scores about the same in both domains
As a result, their contributions, if they do not entirely cancel out, at least do not manage to produce a dot product value high enough to overcome their norms, and the normalized scalar product is not very high. As a result, their contributions, if they do not entirely cancel out, at least do not manage to produce a dot product value high enough to overcome their norms, and the normalized scalar product is not very high.
%% Cell type:markdown id:e1fccadd tags: %% Cell type:markdown id:e1fccadd tags:
Now looking at the other pair reveals a different story: Now looking at the other pair reveals a different story:
%% Cell type:code id:a26a0f41 tags: %% Cell type:code id:a26a0f41 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top10_2grams(architecture) top10_2grams(architecture)
``` ```
%% Output %% Output
{('d.', 'j.'): 325, {('d.', 'j.'): 325,
('s.', 'm.'): 323, ('s.', 'm.'): 323,
('s.', 'f.'): 194, ('s.', 'f.'): 194,
('daviler', 'd.'): 65, ('daviler', 'd.'): 65,
('plate', 'bande'): 56, ('plate', 'bande'): 56,
('vers', 'act'): 50, ('vers', 'act'): 50,
('piece', 'bois'): 40, ('piece', 'bois'): 40,
('pierre', 'dure'): 35, ('pierre', 'dure'): 35,
('porte', 'croisée'): 30, ('porte', 'croisée'): 30,
('a', 'donné'): 30} ('a', 'donné'): 30}
%% Cell type:code id:ad912f85 tags: %% Cell type:code id:ad912f85 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top10_2grams(metiers) top10_2grams(metiers)
``` ```
%% Output %% Output
{('d.', 'j.'): 631, {('d.', 'j.'): 631,
('s.', 'm.'): 612, ('s.', 'm.'): 612,
('s.', 'f.'): 438, ('s.', 'f.'): 438,
('morceau', 'bois'): 200, ('morceau', 'bois'): 200,
('a', 'b'): 164, ('a', 'b'): 164,
('pieces', 'bois'): 161, ('pieces', 'bois'): 161,
('piece', 'bois'): 155, ('piece', 'bois'): 155,
('fig', '1'): 139, ('fig', '1'): 139,
('fig', '2'): 139, ('fig', '2'): 139,
('or', 'argent'): 137} ('or', 'argent'): 137}
%% Cell type:code id:4d8495df tags: %% Cell type:code id:4d8495df tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
set(top10_2grams(architecture)).intersection(top10_2grams(metiers)) set(top10_2grams(architecture)).intersection(top10_2grams(metiers))
``` ```
%% Output %% Output
{('d.', 'j.'), ('piece', 'bois'), ('s.', 'f.'), ('s.', 'm.')} {('d.', 'j.'), ('piece', 'bois'), ('s.', 'f.'), ('s.', 'm.')}
%% Cell type:markdown id:a75a2b67 tags: %% Cell type:markdown id:a75a2b67 tags:
They only have four 2-grams in commons, but now three out of the four are the top 3 of each domain. That is to say, the "coordinates" that contribute the most to their norms are (almost all) the ones they have in common. Moreover, even the distribution in that top 3 is similar from one domain to the other: They only have four 2-grams in commons, but now three out of the four are the top 3 of each domain. That is to say, the "coordinates" that contribute the most to their norms are (almost all) the ones they have in common. Moreover, even the distribution in that top 3 is similar from one domain to the other:
- they are in the same order: ('d.', 'j.'), ('s.', 'm.'), then ('s.', 'f.') - they are in the same order: ('d.', 'j.'), ('s.', 'm.'), then ('s.', 'f.')
- their frequencies are almost proportional, with a ratio close to 0.5 - their frequencies are almost proportional, with a ratio close to 0.5
%% Cell type:code id:8ba0d6e3 tags: %% Cell type:code id:8ba0d6e3 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
list(map(lambda p: p[0] / p[1], zip([325, 323, 194], [631, 612, 438]))) list(map(lambda p: p[0] / p[1], zip([325, 323, 194], [631, 612, 438])))
``` ```
%% Output %% Output
[0.5150554675118859, 0.5277777777777778, 0.4429223744292237] [0.5150554675118859, 0.5277777777777778, 0.4429223744292237]
%% Cell type:markdown id:7919c2c6 tags: %% Cell type:markdown id:7919c2c6 tags:
For these reasons, their scalar product is much closer to 1: For these reasons, their scalar product is much closer to 1:
%% Cell type:code id:6d9bffda tags: %% Cell type:code id:6d9bffda tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
colinearity(top10_2grams(architecture), top10_2grams(metiers)) colinearity(top10_2grams(architecture), top10_2grams(metiers))
``` ```
%% Output %% Output
0.9041105011801019 0.9041105011801019
%% Cell type:markdown id:e63ef397 tags: %% Cell type:markdown id:e63ef397 tags:
Of course, looking at these top 3 2-grams themselves is also very telling: they are not specific at all, there are the two substantive one we've previously noticed and the possible signature ('d.', 'j.'), so in fact, in addition to being common between Architecture and Métier, they are also shared by Belles-lettres, and two of them are found in Philosophie. They are rather like noise, found in apparently many domains. Of course, looking at these top 3 2-grams themselves is also very telling: they are not specific at all, there are the two substantive one we've previously noticed and the possible signature ('d.', 'j.'), so in fact, in addition to being common between Architecture and Métier, they are also shared by Belles-lettres, and two of them are found in Philosophie. They are rather like noise, found in apparently many domains.
In any case, we have an explanation of how this counter-intuitive situation can occur: in terms of algebra, they appear to live in very different (10-D) spaces, but they actually have very low components on the 7 dimensions they don't share, on which they hardly depart from their main, 3-D components, which they share. In that 3-D space where they almost live, they have in addition a very similar direction, and thus their scalar product is very high. In any case, we have an explanation of how this counter-intuitive situation can occur: in terms of algebra, they appear to live in very different (10-D) spaces, but they actually have very low components on the 7 dimensions they don't share, on which they hardly depart from their main, 3-D components, which they share. In that 3-D space where they almost live, they have in addition a very similar direction, and thus their scalar product is very high.
Put more simply, it means that both vectors have very little in common, but what they have in common outweights the rest, so that their most important feature is what they share. Let's generalize this approach by defining a generic routine to perform the comparison, evaluating the number of n-grams two domains share, their scalar product, and the contribution of each common n-gram to the norm of the vector. Put more simply, it means that both vectors have very little in common, but what they have in common outweights the rest, so that their most important feature is what they share. Let's generalize this approach by defining a generic routine to perform the comparison, evaluating the number of n-grams two domains share, their scalar product, and the contribution of each common n-gram to the norm of the vector.
We'll first need to be able to project a vector on a given space (in practice, we'll use it to project on the n-grams it shares with another vector, that is, their "common subspace"). We'll first need to be able to project a vector on a given space (in practice, we'll use it to project on the n-grams it shares with another vector, that is, their "common subspace").
%% Cell type:code id:3254c7c6 tags: %% Cell type:code id:3254c7c6 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
def projector(vector, base): def projector(vector, base):
return dict([(k, vector[k]) for k in base if k in vector]) return dict([(k, vector[k]) for k in base if k in vector])
``` ```
%% Cell type:markdown id:c950da4c tags: %% Cell type:markdown id:c950da4c tags:
Given a vectorizer (`top10_2grams` in the above for instance), we can now define our main tool: Given a vectorizer (`top10_2grams` in the above for instance), we can now define our main tool:
%% Cell type:code id:c69b90e6 tags: %% Cell type:code id:c69b90e6 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
from EDdA.classification.classSimilarities import norm from EDdA.classification.classSimilarities import norm
def studyIntersection(vectorizer): def studyIntersection(vectorizer):
def compare(domain1, domain2): def compare(domain1, domain2):
vector1, vector2 = vectorizer(domain1), vectorizer(domain2) vector1, vector2 = vectorizer(domain1), vectorizer(domain2)
intersection = set(vector1).intersection(vector2) intersection = set(vector1).intersection(vector2)
projected1, projected2 = projector(vector1, intersection), projector(vector2, intersection) projected1, projected2 = projector(vector1, intersection), projector(vector2, intersection)
print(f"Intersection: {len(intersection)}") print(f"Intersection: {len(intersection)}")
print(f"Scalar product: {colinearity(vector1, vector2)}\n") print(f"Scalar product: {colinearity(vector1, vector2)}\n")
print(f"{'n-gram': <30}\t{domain1[:20]: <20}\t{domain2[:20]: <20}") print(f"{'n-gram': <30}\t{domain1[:20]: <20}\t{domain2[:20]: <20}")
for ngram in intersection: for ngram in intersection:
print(f"{str(ngram)[:30]: <30}\t{vector1[ngram]: <20}\t{vector2[ngram]: <20}") print(f"{str(ngram)[:30]: <30}\t{vector1[ngram]: <20}\t{vector2[ngram]: <20}")
norm1, norm2 = norm(vector1), norm(vector2) norm1, norm2 = norm(vector1), norm(vector2)
projNorm1, projNorm2 = norm(projected1), norm(projected2) projNorm1, projNorm2 = norm(projected1), norm(projected2)
print("") print("")
print(f"{'Norm': <30}\t{norm1: <20}\t{norm2: <20}") print(f"{'Norm': <30}\t{norm1: <20}\t{norm2: <20}")
print(f"{'Projected norm': <30}\t{projNorm1: <20}\t{projNorm2: <20}") print(f"{'Projected norm': <30}\t{projNorm1: <20}\t{projNorm2: <20}")
print(f"{'Projection ratio': <30}\t{projNorm1 / norm1: <20}\t{projNorm2 / norm2: <20}") print(f"{'Projection ratio': <30}\t{projNorm1 / norm1: <20}\t{projNorm2 / norm2: <20}")
return compare return compare
``` ```
%% Cell type:markdown id:daef48e6 tags: %% Cell type:markdown id:daef48e6 tags:
So let us declare other vectorizers to be able to study the influence of the size of n-grams and the number of ranks kept for comparison. So let us declare other vectorizers to be able to study the influence of the size of n-grams and the number of ranks kept for comparison.
%% Cell type:code id:d152e0fc tags: %% Cell type:code id:d152e0fc tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top100_2grams = topNGrams(source, 2, 100) top100_2grams = topNGrams(source, 2, 100)
top10_3grams = topNGrams(source, 3, 10) top10_3grams = topNGrams(source, 3, 10)
top100_3grams = topNGrams(source, 3, 100) top100_3grams = topNGrams(source, 3, 100)
``` ```
%% Cell type:markdown id:cd10a9f6 tags: %% Cell type:markdown id:cd10a9f6 tags:
Also, let's work on more domains: Also, let's work on more domains:
%% Cell type:code id:ea9174b6 tags: %% Cell type:code id:ea9174b6 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
maths = 'Mathématiques' maths = 'Mathématiques'
mesure = 'Mesure' mesure = 'Mesure'
physique = 'Physique - [Sciences physico-mathématiques]' physique = 'Physique - [Sciences physico-mathématiques]'
``` ```
%% Cell type:markdown id:6e6b8cbd tags: %% Cell type:markdown id:6e6b8cbd tags:
We can get a different view of the situation we've studied above with our new tool. We can get a different view of the situation we've studied above with our new tool.
%% Cell type:code id:c4ed3787 tags: %% Cell type:code id:c4ed3787 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
compareTop10_2gram = studyIntersection(top10_2grams) compareTop10_2gram = studyIntersection(top10_2grams)
compareTop100_2gram = studyIntersection(top100_2grams) compareTop100_2gram = studyIntersection(top100_2grams)
compareTop10_3gram = studyIntersection(top10_3grams) compareTop10_3gram = studyIntersection(top10_3grams)
compareTop100_3gram = studyIntersection(top100_3grams) compareTop100_3gram = studyIntersection(top100_3grams)
``` ```
%% Cell type:code id:7abde2b3 tags: %% Cell type:code id:7abde2b3 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
compareTop10_2gram(belles_lettres, philo) compareTop10_2gram(belles_lettres, philo)
``` ```
%% Output %% Output
Intersection: 7 Intersection: 7
Scalar product: 0.4508503933694939 Scalar product: 0.4508503933694939
n-gram Belles-lettres - Poé Philosophie n-gram Belles-lettres - Poé Philosophie
('-t', '-il') 62 131 ('-t', '-il') 62 131
('s.', 'f.') 145 142 ('s.', 'f.') 145 142
('sans', 'doute') 54 89 ('sans', 'doute') 54 89
('d.', 'j.') 485 82 ('d.', 'j.') 485 82
('a', 'point') 67 191 ('a', 'point') 67 191
('grand', 'nombre') 57 116 ('grand', 'nombre') 57 116
('1', 'degré') 58 136 ('1', 'degré') 58 136
Norm 566.1139461274558 394.0913599661886 Norm 566.1139461274558 394.0913599661886
Projected norm 523.5570647025976 346.991354359154 Projected norm 523.5570647025976 346.991354359154
Projection ratio 0.9248262973983034 0.8804845515743467 Projection ratio 0.9248262973983034 0.8804845515743467
%% Cell type:code id:2621ff10 tags: %% Cell type:code id:2621ff10 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
compareTop10_2gram(architecture, metiers) compareTop10_2gram(architecture, metiers)
``` ```
%% Output %% Output
Intersection: 4 Intersection: 4
Scalar product: 0.9041105011801019 Scalar product: 0.9041105011801019
n-gram Architecture Métiers n-gram Architecture Métiers
('s.', 'm.') 323 612 ('s.', 'm.') 323 612
('d.', 'j.') 325 631 ('d.', 'j.') 325 631
('piece', 'bois') 40 155 ('piece', 'bois') 40 155
('s.', 'f.') 194 438 ('s.', 'f.') 194 438
Norm 511.93358944300576 1067.1466628350574 Norm 511.93358944300576 1067.1466628350574
Projected norm 499.18934283496077 994.2705869128383 Projected norm 499.18934283496077 994.2705869128383
Projection ratio 0.9751056643462075 0.9317094093434061 Projection ratio 0.9751056643462075 0.9317094093434061
%% Cell type:markdown id:3559d324 tags: %% Cell type:markdown id:3559d324 tags:
We see two cases perfectly illustrated: We see two cases perfectly illustrated:
1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product 1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product
2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share 2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share
But does increasing the number of top ranks efficiently weakens the effect of the noise ? But does increasing the number of top ranks efficiently weakens the effect of the noise ?
%% Cell type:code id:667fde5c tags: %% Cell type:code id:667fde5c tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
compareTop100_2gram(architecture, metiers) compareTop100_2gram(architecture, metiers)
``` ```
%% Output %% Output
Intersection: 22 Intersection: 22
Scalar product: 0.7998536585283141 Scalar product: 0.7998536585283141
n-gram Architecture Métiers n-gram Architecture Métiers
('haut', 'bas') 11 66 ('haut', 'bas') 11 66
('m.', 'pl') 29 51 ('m.', 'pl') 29 51
('barre', 'fer') 20 70 ('barre', 'fer') 20 70
('a', 'plusieurs') 14 45 ('a', 'plusieurs') 14 45
('vers', 'act') 50 110 ('vers', 'act') 50 110
('endroit', 'où') 29 87 ('endroit', 'où') 29 87
('sou', 'nom') 29 56 ('sou', 'nom') 29 56
('s.', 'm.') 323 612 ('s.', 'm.') 323 612
('m.', 'espece') 14 40 ('m.', 'espece') 14 40
('piece', 'bois') 40 155 ('piece', 'bois') 40 155
('où', 'a') 26 93 ('où', 'a') 26 93
('chaque', 'côté') 13 73 ('chaque', 'côté') 13 73
('pouce', 'épaisseur') 15 52 ('pouce', 'épaisseur') 15 52
('a', 'point') 17 83 ('a', 'point') 17 83
('a', 'b') 22 164 ('a', 'b') 22 164
('1', 'degré') 14 95 ('1', 'degré') 14 95
('a', 'donné') 30 49 ('a', 'donné') 30 49
('s.', 'f.') 194 438 ('s.', 'f.') 194 438
('pieces', 'bois') 26 161 ('pieces', 'bois') 26 161
('d.', 'j.') 325 631 ('d.', 'j.') 325 631
('grand', 'nombre') 18 67 ('grand', 'nombre') 18 67
('2', 'degré') 13 95 ('2', 'degré') 13 95
Norm 535.4941643006018 1230.2991506133783 Norm 535.4941643006018 1230.2991506133783
Projected norm 509.15027251293895 1062.1624169589131 Projected norm 509.15027251293895 1062.1624169589131
Projection ratio 0.9508045212368091 0.8633367067102022 Projection ratio 0.9508045212368091 0.8633367067102022
%% Cell type:markdown id:ed4c2139 tags: %% Cell type:markdown id:ed4c2139 tags:
Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the "long-tail" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions. Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the "long-tail" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.
Let's try and find less noisy top n-grams in the new domains we've made available.
%% Cell type:code id:b7a352d7 tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top10_2grams(maths)
```
%% Output
{('a', 'b'): 649,
('b', 'a'): 202,
('ligne', 'droite'): 193,
('a', 'a'): 149,
('1', 'degré'): 142,
('2', 'degré'): 129,
('b', 'b'): 110,
('s.', 'f.'): 110,
('angle', 'droit'): 102,
('2', '3'): 97}
%% Cell type:code id:20caa4fb tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
top10_2grams(physique)
```
%% Output
{('a', 'b'): 370,
('1', 'degré'): 301,
('2', 'degré'): 298,
('ligne', 'droite'): 235,
('s.', 'm.'): 228,
('3', 'degré'): 208,
('s.', 'f.'): 197,
('m.', 'newton'): 180,
('quart', 'cercle'): 180,
('grand', 'nombre'): 166}
%% Cell type:markdown id:d00e8eac tags:
Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above.
%% Cell type:code id:8a3b1aba tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
compareTop10_2gram(maths, physique)
```
%% Output
Intersection: 5
Scalar product: 0.6471193972004591
n-gram Mathématiques Physique - [Sciences
('s.', 'f.') 110 197
('1', 'degré') 142 301
('ligne', 'droite') 193 235
('a', 'b') 649 370
('2', 'degré') 129 298
Norm 776.062497483289 773.267741471219
Projected norm 712.2885651195027 640.5770835738663
Projection ratio 0.9178237157824269 0.8284026983397814
%% Cell type:markdown id:b40c55f4 tags:
Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:
%% Cell type:code id:0fdc33ac tags:
``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
compareTop100_2gram(maths, physique)
```
%% Output
Intersection: 27
Scalar product: 0.5766009312352989
n-gram Mathématiques Physique - [Sciences
('a', 'donné') 86 122
('b', 'e') 93 51
('infiniment', 'petit') 64 46
('point', 'a') 42 84
('où', 'ensuit') 29 72
('5', 'degré') 37 82
('1', '2') 94 50
('1', 'degré') 142 301
('point', 'où') 35 80
('partie', 'égal') 59 81
('a', 'a') 149 57
('m.', 'newton') 48 180
('2', 'degré') 129 298
('2', '3') 97 51
('grand', 'nombre') 43 166
('s.', 'f.') 110 197
('s.', 'm.') 91 228
('3', 'degré') 86 208
('ligne', 'droite') 193 235
('a', 'point') 57 121
('angle', 'droit') 102 52
('4', 'degré') 52 109
('sinus', 'angle') 45 67
('ci', 'dessus') 78 64
('e', 'f') 44 96
('a', 'b') 649 370
('quart', 'cercle') 61 180
Norm 923.9372273049722 1006.8286845337691
Projected norm 791.5649057405211 839.5510705132833
Projection ratio 0.8567301785743959 0.8338569246286951
%% Cell type:markdown id:57a1ee20 tags:
Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.
This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis.
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment