{
"cells": [
{
"cell_type": "markdown",
"id": "3401c7f3",
"metadata": {},
"source": [
"# So what's wrong with scalar product vs. elements intersection ?\n",
"\n",
"A short notebook to explain the \"shift\" in classes comparisons obtained from [that other notebook](Confusion%20Matrix.ipynb). Looking at the matrices produced we observe that the scalar product, which can be seen intuitively as more restrictive a metrics than elements intersection (counting the number of elements in the intersections of two sets can be viewed as a scalar product of vectors where all elements in each set are associated to the integer `1`, so you'd expect the more nuanced scalar product to be systematically inferior to that all-or-nothing metrics), sometimes make classes appear more similar than computing the number of n-grams they have in common. Let's see why, starting from an illustration of the issue at stake:\n",
"\n",
"\n",
"\n",
"Both pictures are confusion matrices comparing the top 10 2-grams of each domain in the EDdA. The one on the left merely counts the number of 2-grams common between both classes whereas the one on the right uses the scalar product computed on the vectors obtained for each domain, by associating to each one of the most frequent 2-grams found in that domain its number of occurrences.\n",
"\n",
"The cells circled in green, at the intersection between Belles-lettres and Philosophie, presents the expected behaviour: The blue circle is darker on the left, where they elements are simply counted, and appear lighter when the matrix is computed with the metrics derived from the scalar product. The one in purple (Architecture / Métiers), however, is counter-intuitive: the scalar product method makes them appear closer than they were with the elements intersection method. To understand how this is possible, let's look at the top 10 2-grams of the involved classes."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cf51dd1",
"metadata": {},
"outputs": [],
"source": [
"from EDdA import data\n",
"from EDdA.classification import topNGrams"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30dfceaa",
"metadata": {},
"outputs": [],
"source": [
"source = data.load('training_set')\n",
"top10_2grams = topNGrams(source, 2, 10)\n",
"\n",
"architecture = 'Architecture'\n",
"belles_lettres = 'Belles-lettres - Poésie'\n",
"metiers = 'Métiers'\n",
"philo = 'Philosophie'"
]
},
{
"cell_type": "markdown",
"id": "b6f895c4",
"metadata": {},
"source": [
"We have everything ready to display the vectors. Let's start with the Belles-lettres:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee7a9ab9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'): 485,\n",
" ('s.', 'm.'): 196,\n",
" ('s.', 'f.'): 145,\n",
" ('chez', 'romain'): 71,\n",
" ('a', 'point'): 67,\n",
" ('-t', '-il'): 62,\n",
" ('1', 'degré'): 58,\n",
" ('grand', 'nombre'): 57,\n",
" ('sans', 'doute'): 54,\n",
" ('sou', 'nom'): 54}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(belles_lettres)"
]
},
{
"cell_type": "markdown",
"id": "3dcc0f86",
"metadata": {},
"source": [
"We first notice the occurrences of the 2-gram ('s.', 'm.') and ('s.', 'f.') for \"substantif masculin\" and \"substantif féminin\", which occur very frequently at the begining of article. Those are of course unwanted, irrelevant bigrams but they haven't been filtered out (yet ?). Let's look at the Philosophie domain now:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3397e87",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('a', 'point'): 191,\n",
" ('s.', 'f.'): 142,\n",
" ('1', 'degré'): 136,\n",
" ('2', 'degré'): 131,\n",
" ('-t', '-il'): 131,\n",
" ('grand', 'nombre'): 116,\n",
" ('dieu', 'a'): 100,\n",
" ('sans', 'doute'): 89,\n",
" ('3', 'degré'): 88,\n",
" ('d.', 'j.'): 82}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(philo)"
]
},
{
"cell_type": "markdown",
"id": "41d147a0",
"metadata": {},
"source": [
"Interestingly enough, the Philosophie domain seems to comprise much fewer masculine substantives, so that ('s.', 'm.') didn't even make the top 10. The ('d.', 'j.') bigram is there too (probably the signature of an author ?), as well as ('-t', '-il'), ('1', 'degré'), ('a', 'point'), ('grand', 'nombre') and ('sans', 'doute'). Quite a populated intersection which account for the relatively dark blue patch in the matrix on the left (7/10)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "320cf92e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('-t', '-il'),\n",
" ('1', 'degré'),\n",
" ('a', 'point'),\n",
" ('d.', 'j.'),\n",
" ('grand', 'nombre'),\n",
" ('s.', 'f.'),\n",
" ('sans', 'doute')}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set(top10_2grams(belles_lettres)).intersection(top10_2grams(philo))"
]
},
{
"cell_type": "markdown",
"id": "6747e51b",
"metadata": {},
"source": [
"Now if we look at their (normalized) scalar product, though, the result is pretty average:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "911d6a17",
"metadata": {},
"outputs": [],
"source": [
"from EDdA.classification import colinearity"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fa76ac3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4508503933694939"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"colinearity(top10_2grams(belles_lettres), top10_2grams(philo))"
]
},
{
"cell_type": "markdown",
"id": "36532f01",
"metadata": {},
"source": [
"Indeed, if we look at the cardinalities associated to each 2-gram they share, they aren't synchronized but rather in phase opposition instead:\n",
"\n",
"- the most frequent in belles_lettres, ('d.', 'j.') (485) is the least frequent in philo (82)\n",
"- the most frequent in philo, ('a', 'point') (191) is rather low in belles_lettres (67)\n",
"- the trend is general to the other 2-grams, except maybe ('s.', 'f.') which scores about the same in both domains\n",
"\n",
"As a result, their contributions, if they do not entirely cancel out, at least do not manage to produce a dot product value high enough to overcome their norms, and the normalized scalar product is not very high."
]
},
{
"cell_type": "markdown",
"id": "e1fccadd",
"metadata": {},
"source": [
"Now looking at the other pair reveals a different story:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a26a0f41",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'): 325,\n",
" ('s.', 'm.'): 323,\n",
" ('s.', 'f.'): 194,\n",
" ('daviler', 'd.'): 65,\n",
" ('plate', 'bande'): 56,\n",
" ('vers', 'act'): 50,\n",
" ('piece', 'bois'): 40,\n",
" ('pierre', 'dure'): 35,\n",
" ('porte', 'croisée'): 30,\n",
" ('a', 'donné'): 30}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(architecture)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad912f85",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'): 631,\n",
" ('s.', 'm.'): 612,\n",
" ('s.', 'f.'): 438,\n",
" ('morceau', 'bois'): 200,\n",
" ('a', 'b'): 164,\n",
" ('pieces', 'bois'): 161,\n",
" ('piece', 'bois'): 155,\n",
" ('fig', '1'): 139,\n",
" ('fig', '2'): 139,\n",
" ('or', 'argent'): 137}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(metiers)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d8495df",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'), ('piece', 'bois'), ('s.', 'f.'), ('s.', 'm.')}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set(top10_2grams(architecture)).intersection(top10_2grams(metiers))"
]
},
{
"cell_type": "markdown",
"id": "a75a2b67",
"metadata": {},
"source": [
"They only have four 2-grams in commons, but now three out of the four are the top 3 of each domain. That is to say, the \"coordinates\" that contribute the most to their norms are (almost all) the ones they have in common. Moreover, even the distribution in that top 3 is similar from one domain to the other:\n",
"\n",
"- they are in the same order: ('d.', 'j.'), ('s.', 'm.'), then ('s.', 'f.')\n",
"- their frequencies are almost proportional, with a ratio close to 0.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ba0d6e3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.5150554675118859, 0.5277777777777778, 0.4429223744292237]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(map(lambda p: p[0] / p[1], zip([325, 323, 194], [631, 612, 438])))"
]
},
{
"cell_type": "markdown",
"id": "7919c2c6",
"metadata": {},
"source": [
"For these reasons, their scalar product is much closer to 1:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d9bffda",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9041105011801019"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"colinearity(top10_2grams(architecture), top10_2grams(metiers))"
]
},
{
"cell_type": "markdown",
"id": "e63ef397",
"metadata": {},
"source": [
"Of course, looking at these top 3 2-grams themselves is also very telling: they are not specific at all, there are the two substantive one we've previously noticed and the possible signature ('d.', 'j.'), so in fact, in addition to being common between Architecture and Métier, they are also shared by Belles-lettres, and two of them are found in Philosophie. They are rather like noise, found in apparently many domains.\n",
"\n",
"In any case, we have an explanation of how this counter-intuitive situation can occur: in terms of algebra, they appear to live in very different (10-D) spaces, but they actually have very low components on the 7 dimensions they don't share, on which they hardly depart from their main, 3-D components, which they share. In that 3-D space where they almost live, they have in addition a very similar direction, and thus their scalar product is very high.\n",
"\n",
"Put more simply, it means that both vectors have very little in common, but what they have in common outweights the rest, so that their most important feature is what they share. Let's generalize this approach by defining a generic routine to perform the comparison, evaluating the number of n-grams two domains share, their scalar product, and the contribution of each common n-gram to the norm of the vector.\n",
"\n",
"We'll first need to be able to project a vector on a given space (in practice, we'll use it to project on the n-grams it shares with another vector, that is, their \"common subspace\")."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3254c7c6",
"metadata": {},
"outputs": [],
"source": [
"def projector(vector, base):\n",
" return dict([(k, vector[k]) for k in base if k in vector])"
]
},
{
"cell_type": "markdown",
"id": "c950da4c",
"metadata": {},
"source": [
"Given a vectorizer (`top10_2grams` in the above for instance), we can now define our main tool:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c69b90e6",
"metadata": {},
"outputs": [],
"source": [
"from EDdA.classification.classSimilarities import norm\n",
"\n",
"def studyIntersection(vectorizer):\n",
" def compare(domain1, domain2):\n",
" vector1, vector2 = vectorizer(domain1), vectorizer(domain2)\n",
" intersection = set(vector1).intersection(vector2)\n",
" projected1, projected2 = projector(vector1, intersection), projector(vector2, intersection)\n",
" print(f\"Intersection: {len(intersection)}\")\n",
" print(f\"Scalar product: {colinearity(vector1, vector2)}\\n\")\n",
" print(f\"{'n-gram': <30}\\t{domain1[:20]: <20}\\t{domain2[:20]: <20}\")\n",
" for ngram in intersection:\n",
" print(f\"{str(ngram)[:30]: <30}\\t{vector1[ngram]: <20}\\t{vector2[ngram]: <20}\")\n",
" norm1, norm2 = norm(vector1), norm(vector2)\n",
" projNorm1, projNorm2 = norm(projected1), norm(projected2)\n",
" print(\"\")\n",
" print(f\"{'Norm': <30}\\t{norm1: <20}\\t{norm2: <20}\")\n",
" print(f\"{'Projected norm': <30}\\t{projNorm1: <20}\\t{projNorm2: <20}\")\n",
" print(f\"{'Projection ratio': <30}\\t{projNorm1 / norm1: <20}\\t{projNorm2 / norm2: <20}\")\n",
" return compare"
]
},
{
"cell_type": "markdown",
"id": "daef48e6",
"metadata": {},
"source": [
"So let us declare other vectorizers to be able to study the influence of the size of n-grams and the number of ranks kept for comparison."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d152e0fc",
"metadata": {},
"outputs": [],
"source": [
"top100_2grams = topNGrams(source, 2, 100)\n",
"top10_3grams = topNGrams(source, 3, 10)\n",
"top100_3grams = topNGrams(source, 3, 100)"
]
},
{
"cell_type": "markdown",
"id": "cd10a9f6",
"metadata": {},
"source": [
"Also, let's work on more domains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea9174b6",
"metadata": {},
"outputs": [],
"source": [
"maths = 'Mathématiques'\n",
"mesure = 'Mesure'\n",
"physique = 'Physique - [Sciences physico-mathématiques]'"
]
},
{
"cell_type": "markdown",
"id": "6e6b8cbd",
"metadata": {},
"source": [
"We can get a different view of the situation we've studied above with our new tool."
]
},
{
"cell_type": "code",
"execution_count": 15,
"id": "c4ed3787",
"metadata": {},
"outputs": [],
"source": [
"compareTop10_2gram = studyIntersection(top10_2grams)\n",
"compareTop100_2gram = studyIntersection(top100_2grams)\n",
"compareTop10_3gram = studyIntersection(top10_3grams)\n",
"compareTop100_3gram = studyIntersection(top100_3grams)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7abde2b3",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 7\n",
"Scalar product: 0.4508503933694939\n",
"\n",
"n-gram \tBelles-lettres - Poé\tPhilosophie \n",
"('-t', '-il') \t62 \t131 \n",
"('s.', 'f.') \t145 \t142 \n",
"('sans', 'doute') \t54 \t89 \n",
"('d.', 'j.') \t485 \t82 \n",
"('a', 'point') \t67 \t191 \n",
"('grand', 'nombre') \t57 \t116 \n",
"('1', 'degré') \t58 \t136 \n",
"\n",
"Norm \t566.1139461274558 \t394.0913599661886 \n",
"Projected norm \t523.5570647025976 \t346.991354359154 \n",
"Projection ratio \t0.9248262973983034 \t0.8804845515743467 \n"
]
}
],
"source": [
"compareTop10_2gram(belles_lettres, philo)"
]
},
{
"cell_type": "code",
"execution_count": 17,
"id": "2621ff10",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 4\n",
"Scalar product: 0.9041105011801019\n",
"\n",
"n-gram \tArchitecture \tMétiers \n",
"('s.', 'm.') \t323 \t612 \n",
"('d.', 'j.') \t325 \t631 \n",
"('piece', 'bois') \t40 \t155 \n",
"('s.', 'f.') \t194 \t438 \n",
"\n",
"Norm \t511.93358944300576 \t1067.1466628350574 \n",
"Projected norm \t499.18934283496077 \t994.2705869128383 \n",
"Projection ratio \t0.9751056643462075 \t0.9317094093434061 \n"
]
}
],
"source": [
"compareTop10_2gram(architecture, metiers)"
]
},
{
"cell_type": "markdown",
"id": "3559d324",
"metadata": {},
"source": [
"We see two cases perfectly illustrated:\n",
"\n",
"1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product\n",
"2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n",
"\n",
"But does increasing the number of top ranks efficiently weakens the effect of the noise ?"
]
},
{
"cell_type": "code",
"execution_count": 18,
"id": "667fde5c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 22\n",
"Scalar product: 0.7998536585283141\n",
"\n",
"n-gram \tArchitecture \tMétiers \n",
"('haut', 'bas') \t11 \t66 \n",
"('m.', 'pl') \t29 \t51 \n",
"('barre', 'fer') \t20 \t70 \n",
"('a', 'plusieurs') \t14 \t45 \n",
"('vers', 'act') \t50 \t110 \n",
"('endroit', 'où') \t29 \t87 \n",
"('sou', 'nom') \t29 \t56 \n",
"('s.', 'm.') \t323 \t612 \n",
"('m.', 'espece') \t14 \t40 \n",
"('piece', 'bois') \t40 \t155 \n",
"('où', 'a') \t26 \t93 \n",
"('chaque', 'côté') \t13 \t73 \n",
"('pouce', 'épaisseur') \t15 \t52 \n",
"('a', 'point') \t17 \t83 \n",
"('a', 'b') \t22 \t164 \n",
"('1', 'degré') \t14 \t95 \n",
"('a', 'donné') \t30 \t49 \n",
"('s.', 'f.') \t194 \t438 \n",
"('pieces', 'bois') \t26 \t161 \n",
"('d.', 'j.') \t325 \t631 \n",
"('grand', 'nombre') \t18 \t67 \n",
"('2', 'degré') \t13 \t95 \n",
"\n",
"Norm \t535.4941643006018 \t1230.2991506133783 \n",
"Projected norm \t509.15027251293895 \t1062.1624169589131 \n",
"Projection ratio \t0.9508045212368091 \t0.8633367067102022 \n"
]
}
],
"source": [
"compareTop100_2gram(architecture, metiers)"
]
},
{
"cell_type": "markdown",
"id": "ed4c2139",
"metadata": {},
"source": [
"Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "/gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}