Finish the analysis of scalar products vs. common elements count

d7e9f12b · Alice Brenon · f24611d4 · d7e9f12b
Commit d7e9f12b authored 2 years ago by Alice Brenon
--- a/notebooks/Scalar_product_vs_elements_intersection.ipynb
+++ b/notebooks/Scalar_product_vs_elements_intersection.ipynb
@@ -468,7 +468,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
   "id": "c4ed3787",
   "metadata": {},
   "outputs": [],
@@ -513,7 +513,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": null,
   "id": "2621ff10",
   "metadata": {},
   "outputs": [
@@ -547,7 +547,7 @@
   "source": [
    "We see two cases perfectly illustrated:\n",
    "\n",
-    "1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product\n",
+    "1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product\n",
    "2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n",
    "\n",
    "But does increasing the number of top ranks efficiently weakens the effect of the noise ?"
@@ -555,7 +555,7 @@
  },
  {
   "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": null,
   "id": "667fde5c",
   "metadata": {},
   "outputs": [
@@ -605,7 +605,177 @@
   "id": "ed4c2139",
   "metadata": {},
   "source": [
-    "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions."
+    "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.\n",
+    "\n",
+    "Let's try and find less noisy top n-grams in the new domains we've made available."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7a352d7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{('a', 'b'): 649,\n",
+       " ('b', 'a'): 202,\n",
+       " ('ligne', 'droite'): 193,\n",
+       " ('a', 'a'): 149,\n",
+       " ('1', 'degré'): 142,\n",
+       " ('2', 'degré'): 129,\n",
+       " ('b', 'b'): 110,\n",
+       " ('s.', 'f.'): 110,\n",
+       " ('angle', 'droit'): 102,\n",
+       " ('2', '3'): 97}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "top10_2grams(maths)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20caa4fb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{('a', 'b'): 370,\n",
+       " ('1', 'degré'): 301,\n",
+       " ('2', 'degré'): 298,\n",
+       " ('ligne', 'droite'): 235,\n",
+       " ('s.', 'm.'): 228,\n",
+       " ('3', 'degré'): 208,\n",
+       " ('s.', 'f.'): 197,\n",
+       " ('m.', 'newton'): 180,\n",
+       " ('quart', 'cercle'): 180,\n",
+       " ('grand', 'nombre'): 166}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "top10_2grams(physique)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d00e8eac",
+   "metadata": {},
+   "source": [
+    "Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a3b1aba",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Intersection: 5\n",
+      "Scalar product: 0.6471193972004591\n",
+      "\n",
+      "n-gram                        \tMathématiques       \tPhysique - [Sciences\n",
+      "('s.', 'f.')                  \t110                 \t197                 \n",
+      "('1', 'degré')                \t142                 \t301                 \n",
+      "('ligne', 'droite')           \t193                 \t235                 \n",
+      "('a', 'b')                    \t649                 \t370                 \n",
+      "('2', 'degré')                \t129                 \t298                 \n",
+      "\n",
+      "Norm                          \t776.062497483289    \t773.267741471219    \n",
+      "Projected norm                \t712.2885651195027   \t640.5770835738663   \n",
+      "Projection ratio              \t0.9178237157824269  \t0.8284026983397814  \n"
+     ]
+    }
+   ],
+   "source": [
+    "compareTop10_2gram(maths, physique)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b40c55f4",
+   "metadata": {},
+   "source": [
+    "Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fdc33ac",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Intersection: 27\n",
+      "Scalar product: 0.5766009312352989\n",
+      "\n",
+      "n-gram                        \tMathématiques       \tPhysique - [Sciences\n",
+      "('a', 'donné')                \t86                  \t122                 \n",
+      "('b', 'e')                    \t93                  \t51                  \n",
+      "('infiniment', 'petit')       \t64                  \t46                  \n",
+      "('point', 'a')                \t42                  \t84                  \n",
+      "('où', 'ensuit')              \t29                  \t72                  \n",
+      "('5', 'degré')                \t37                  \t82                  \n",
+      "('1', '2')                    \t94                  \t50                  \n",
+      "('1', 'degré')                \t142                 \t301                 \n",
+      "('point', 'où')               \t35                  \t80                  \n",
+      "('partie', 'égal')            \t59                  \t81                  \n",
+      "('a', 'a')                    \t149                 \t57                  \n",
+      "('m.', 'newton')              \t48                  \t180                 \n",
+      "('2', 'degré')                \t129                 \t298                 \n",
+      "('2', '3')                    \t97                  \t51                  \n",
+      "('grand', 'nombre')           \t43                  \t166                 \n",
+      "('s.', 'f.')                  \t110                 \t197                 \n",
+      "('s.', 'm.')                  \t91                  \t228                 \n",
+      "('3', 'degré')                \t86                  \t208                 \n",
+      "('ligne', 'droite')           \t193                 \t235                 \n",
+      "('a', 'point')                \t57                  \t121                 \n",
+      "('angle', 'droit')            \t102                 \t52                  \n",
+      "('4', 'degré')                \t52                  \t109                 \n",
+      "('sinus', 'angle')            \t45                  \t67                  \n",
+      "('ci', 'dessus')              \t78                  \t64                  \n",
+      "('e', 'f')                    \t44                  \t96                  \n",
+      "('a', 'b')                    \t649                 \t370                 \n",
+      "('quart', 'cercle')           \t61                  \t180                 \n",
+      "\n",
+      "Norm                          \t923.9372273049722   \t1006.8286845337691  \n",
+      "Projected norm                \t791.5649057405211   \t839.5510705132833   \n",
+      "Projection ratio              \t0.8567301785743959  \t0.8338569246286951  \n"
+     ]
+    }
+   ],
+   "source": [
+    "compareTop100_2gram(maths, physique)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57a1ee20",
+   "metadata": {},
+   "source": [
+    "Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.\n",
+    "\n",
+    "This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis."
   ]
  }
 ],

 %% Cell type:markdown id:3401c7f3 tags:
 # So what's wrong with scalar product vs. elements intersection ?
 A short notebook to explain the "shift" in classes comparisons obtained from [that other notebook](Confusion%20Matrix.ipynb). Looking at the matrices produced we observe that the scalar product, which can be seen intuitively as more restrictive a metrics than elements intersection (counting the number of elements in the intersections of two sets can be viewed as a scalar product of vectors where all elements in each set are associated to the integer `1`, so you'd expect the more nuanced scalar product to be systematically inferior to that all-or-nothing metrics), sometimes make classes appear more similar than computing the number of n-grams they have in common. Let's see why, starting from an illustration of the issue at stake:
 ![The pairs (Belles-lettres / Philosophie) and (Architecture / Métiers)](../img/colinearity_vs_keys_intersections.png)
 Both pictures are confusion matrices comparing the top 10 2-grams of each domain in the EDdA. The one on the left merely counts the number of 2-grams common between both classes whereas the one on the right uses the scalar product computed on the vectors obtained for each domain, by associating to each one of the most frequent 2-grams found in that domain its  number of occurrences.
 The cells circled in green, at the intersection between Belles-lettres and Philosophie, presents the expected behaviour: The blue circle is darker on the left, where they elements are simply counted, and appear lighter when the matrix is computed with the metrics derived from the scalar product. The one in purple (Architecture / Métiers), however, is counter-intuitive: the scalar product method makes them appear closer than they were with the elements intersection method. To understand how this is possible, let's look at the top 10 2-grams of the involved classes.
 %% Cell type:code id:7cf51dd1 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 from EDdA import data
 from EDdA.classification import topNGrams
 ```
 %% Cell type:code id:30dfceaa tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 source = data.load('training_set')
 top10_2grams = topNGrams(source, 2, 10)
 architecture = 'Architecture'
 belles_lettres = 'Belles-lettres - Poésie'
 metiers = 'Métiers'
 philo = 'Philosophie'
 ```
 %% Cell type:markdown id:b6f895c4 tags:
 We have everything ready to display the vectors. Let's start with the Belles-lettres:
 %% Cell type:code id:ee7a9ab9 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 top10_2grams(belles_lettres)
 ```
 %% Output
    {('d.', 'j.'): 485,
     ('s.', 'm.'): 196,
     ('s.', 'f.'): 145,
     ('chez', 'romain'): 71,
     ('a', 'point'): 67,
     ('-t', '-il'): 62,
     ('1', 'degré'): 58,
     ('grand', 'nombre'): 57,
     ('sans', 'doute'): 54,
     ('sou', 'nom'): 54}
 %% Cell type:markdown id:3dcc0f86 tags:
 We first notice the occurrences of the 2-gram ('s.', 'm.') and ('s.', 'f.') for "substantif masculin" and "substantif féminin", which occur very frequently at the begining of article. Those are of course unwanted, irrelevant bigrams but they haven't been filtered out (yet ?). Let's look at the Philosophie domain now:
 %% Cell type:code id:d3397e87 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 top10_2grams(philo)
 ```
 %% Output
    {('a', 'point'): 191,
     ('s.', 'f.'): 142,
     ('1', 'degré'): 136,
     ('2', 'degré'): 131,
     ('-t', '-il'): 131,
     ('grand', 'nombre'): 116,
     ('dieu', 'a'): 100,
     ('sans', 'doute'): 89,
     ('3', 'degré'): 88,
     ('d.', 'j.'): 82}
 %% Cell type:markdown id:41d147a0 tags:
 Interestingly enough, the Philosophie domain seems to comprise much fewer masculine substantives, so that ('s.', 'm.') didn't even make the top 10. The ('d.', 'j.') bigram is there too (probably the signature of an author ?), as well as ('-t', '-il'), ('1', 'degré'), ('a', 'point'), ('grand', 'nombre') and ('sans', 'doute'). Quite a populated intersection which account for the relatively dark blue patch in the matrix on the left (7/10).
 %% Cell type:code id:320cf92e tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 set(top10_2grams(belles_lettres)).intersection(top10_2grams(philo))
 ```
 %% Output
    {('-t', '-il'),
     ('1', 'degré'),
     ('a', 'point'),
     ('d.', 'j.'),
     ('grand', 'nombre'),
     ('s.', 'f.'),
     ('sans', 'doute')}
 %% Cell type:markdown id:6747e51b tags:
 Now if we look at their (normalized) scalar product, though, the result is pretty average:
 %% Cell type:code id:911d6a17 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 from EDdA.classification import colinearity
 ```
 %% Cell type:code id:0fa76ac3 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 colinearity(top10_2grams(belles_lettres), top10_2grams(philo))
 ```
 %% Output
    0.4508503933694939
 %% Cell type:markdown id:36532f01 tags:
 Indeed, if we look at the cardinalities associated to each 2-gram they share, they aren't synchronized but rather in phase opposition instead:
 - the most frequent in belles_lettres, ('d.', 'j.') (485) is the least frequent in philo (82)
 - the most frequent in philo, ('a', 'point') (191) is rather low in belles_lettres (67)
 - the trend is general to the other 2-grams, except maybe ('s.', 'f.') which scores about the same in both domains
 As a result, their contributions, if they do not entirely cancel out, at least do not manage to produce a dot product value high enough to overcome their norms, and the normalized scalar product is not very high.
 %% Cell type:markdown id:e1fccadd tags:
 Now looking at the other pair reveals a different story:
 %% Cell type:code id:a26a0f41 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 top10_2grams(architecture)
 ```
 %% Output
    {('d.', 'j.'): 325,
     ('s.', 'm.'): 323,
     ('s.', 'f.'): 194,
     ('daviler', 'd.'): 65,
     ('plate', 'bande'): 56,
     ('vers', 'act'): 50,
     ('piece', 'bois'): 40,
     ('pierre', 'dure'): 35,
     ('porte', 'croisée'): 30,
     ('a', 'donné'): 30}
 %% Cell type:code id:ad912f85 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 top10_2grams(metiers)
 ```
 %% Output
    {('d.', 'j.'): 631,
     ('s.', 'm.'): 612,
     ('s.', 'f.'): 438,
     ('morceau', 'bois'): 200,
     ('a', 'b'): 164,
     ('pieces', 'bois'): 161,
     ('piece', 'bois'): 155,
     ('fig', '1'): 139,
     ('fig', '2'): 139,
     ('or', 'argent'): 137}
 %% Cell type:code id:4d8495df tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 set(top10_2grams(architecture)).intersection(top10_2grams(metiers))
 ```
 %% Output
    {('d.', 'j.'), ('piece', 'bois'), ('s.', 'f.'), ('s.', 'm.')}
 %% Cell type:markdown id:a75a2b67 tags:
 They only have four 2-grams in commons, but now three out of the four are the top 3 of each domain. That is to say, the "coordinates" that contribute the most to their norms are (almost all) the ones they have in common. Moreover, even the distribution in that top 3 is similar from one domain to the other:
 - they are in the same order: ('d.', 'j.'), ('s.', 'm.'), then ('s.', 'f.')
 - their frequencies are almost proportional, with a ratio close to 0.5
 %% Cell type:code id:8ba0d6e3 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 list(map(lambda p: p[0] / p[1], zip([325, 323, 194], [631, 612, 438])))
 ```
 %% Output
    [0.5150554675118859, 0.5277777777777778, 0.4429223744292237]
 %% Cell type:markdown id:7919c2c6 tags:
 For these reasons, their scalar product is much closer to 1:
 %% Cell type:code id:6d9bffda tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 colinearity(top10_2grams(architecture), top10_2grams(metiers))
 ```
 %% Output
    0.9041105011801019
 %% Cell type:markdown id:e63ef397 tags:
 Of course, looking at these top 3 2-grams themselves is also very telling: they are not specific at all, there are the two substantive one we've previously noticed and the possible signature ('d.', 'j.'), so in fact, in addition to being common between Architecture and Métier, they are also shared by Belles-lettres, and two of them are found in Philosophie. They are rather like noise, found in apparently many domains.
 In any case, we have an explanation of how this counter-intuitive situation can occur: in terms of algebra, they appear to live in very different (10-D) spaces, but they actually have very low components on the 7 dimensions they don't share, on which they hardly depart from their main, 3-D components, which they share. In that 3-D space where they almost live, they have in addition a very similar direction, and thus their scalar product is very high.
 Put more simply, it means that both vectors have very little in common, but what they have in common outweights the rest, so that their most important feature is what they share. Let's generalize this approach by defining a generic routine to perform the comparison, evaluating the number of n-grams two domains share, their scalar product, and the contribution of each common n-gram to the norm of the vector.
 We'll first need to be able to project a vector on a given space (in practice, we'll use it to project on the n-grams it shares with another vector, that is, their "common subspace").
 %% Cell type:code id:3254c7c6 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 def projector(vector, base):
    return dict([(k, vector[k]) for k in base if k in vector])
 ```
 %% Cell type:markdown id:c950da4c tags:
 Given a vectorizer (`top10_2grams` in the above for instance), we can now define our main tool:
 %% Cell type:code id:c69b90e6 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 from EDdA.classification.classSimilarities import norm
 def studyIntersection(vectorizer):
    def compare(domain1, domain2):
        vector1, vector2 = vectorizer(domain1), vectorizer(domain2)
        intersection = set(vector1).intersection(vector2)
        projected1, projected2 = projector(vector1, intersection), projector(vector2, intersection)
        print(f"Intersection: {len(intersection)}")
        print(f"Scalar product: {colinearity(vector1, vector2)}\n")
        print(f"{'n-gram': <30}\t{domain1[:20]: <20}\t{domain2[:20]: <20}")
        for ngram in intersection:
            print(f"{str(ngram)[:30]: <30}\t{vector1[ngram]: <20}\t{vector2[ngram]: <20}")
        norm1, norm2 = norm(vector1), norm(vector2)
        projNorm1, projNorm2 = norm(projected1), norm(projected2)
        print("")
        print(f"{'Norm': <30}\t{norm1: <20}\t{norm2: <20}")
        print(f"{'Projected norm': <30}\t{projNorm1: <20}\t{projNorm2: <20}")
        print(f"{'Projection ratio': <30}\t{projNorm1 / norm1: <20}\t{projNorm2 / norm2: <20}")
    return compare
 ```
 %% Cell type:markdown id:daef48e6 tags:
 So let us declare other vectorizers to be able to study the influence of the size of n-grams and the number of ranks kept for comparison.
 %% Cell type:code id:d152e0fc tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 top100_2grams = topNGrams(source, 2, 100)
 top10_3grams = topNGrams(source, 3, 10)
 top100_3grams = topNGrams(source, 3, 100)
 ```
 %% Cell type:markdown id:cd10a9f6 tags:
 Also, let's work on more domains:
 %% Cell type:code id:ea9174b6 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 maths = 'Mathématiques'
 mesure = 'Mesure'
 physique = 'Physique - [Sciences physico-mathématiques]'
 ```
 %% Cell type:markdown id:6e6b8cbd tags:
 We can get a different view of the situation we've studied above with our new tool.
 %% Cell type:code id:c4ed3787 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 compareTop10_2gram = studyIntersection(top10_2grams)
 compareTop100_2gram = studyIntersection(top100_2grams)
 compareTop10_3gram = studyIntersection(top10_3grams)
 compareTop100_3gram = studyIntersection(top100_3grams)
 ```
 %% Cell type:code id:7abde2b3 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 compareTop10_2gram(belles_lettres, philo)
 ```
 %% Output
    Intersection: 7
    Scalar product: 0.4508503933694939
    n-gram                        	Belles-lettres - Poé	Philosophie
    ('-t', '-il')                 	62                  	131
    ('s.', 'f.')                  	145                 	142
    ('sans', 'doute')             	54                  	89
    ('d.', 'j.')                  	485                 	82
    ('a', 'point')                	67                  	191
    ('grand', 'nombre')           	57                  	116
    ('1', 'degré')                	58                  	136
    Norm                          	566.1139461274558   	394.0913599661886
    Projected norm                	523.5570647025976   	346.991354359154
    Projection ratio              	0.9248262973983034  	0.8804845515743467
 %% Cell type:code id:2621ff10 tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 compareTop10_2gram(architecture, metiers)
 ```
 %% Output
    Intersection: 4
    Scalar product: 0.9041105011801019
    n-gram                        	Architecture        	Métiers
    ('s.', 'm.')                  	323                 	612
    ('d.', 'j.')                  	325                 	631
    ('piece', 'bois')             	40                  	155
    ('s.', 'f.')                  	194                 	438
    Norm                          	511.93358944300576  	1067.1466628350574
    Projected norm                	499.18934283496077  	994.2705869128383
    Projection ratio              	0.9751056643462075  	0.9317094093434061
 %% Cell type:markdown id:3559d324 tags:
 We see two cases perfectly illustrated:
-1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product
+1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product
 2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share
 But does increasing the number of top ranks efficiently weakens the effect of the noise ?
 %% Cell type:code id:667fde5c tags:
 ``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
 compareTop100_2gram(architecture, metiers)
 ```
 %% Output
    Intersection: 22
    Scalar product: 0.7998536585283141
    n-gram                        	Architecture        	Métiers
    ('haut', 'bas')               	11                  	66
    ('m.', 'pl')                  	29                  	51
    ('barre', 'fer')              	20                  	70
    ('a', 'plusieurs')            	14                  	45
    ('vers', 'act')               	50                  	110
    ('endroit', 'où')             	29                  	87
    ('sou', 'nom')                	29                  	56
    ('s.', 'm.')                  	323                 	612
    ('m.', 'espece')              	14                  	40
    ('piece', 'bois')             	40                  	155
    ('où', 'a')                   	26                  	93
    ('chaque', 'côté')            	13                  	73
    ('pouce', 'épaisseur')        	15                  	52
    ('a', 'point')                	17                  	83
    ('a', 'b')                    	22                  	164
    ('1', 'degré')                	14                  	95
    ('a', 'donné')                	30                  	49
    ('s.', 'f.')                  	194                 	438
    ('pieces', 'bois')            	26                  	161
    ('d.', 'j.')                  	325                 	631
    ('grand', 'nombre')           	18                  	67
    ('2', 'degré')                	13                  	95
    Norm                          	535.4941643006018   	1230.2991506133783
    Projected norm                	509.15027251293895  	1062.1624169589131
    Projection ratio              	0.9508045212368091  	0.8633367067102022
 %% Cell type:markdown id:ed4c2139 tags:
 Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the "long-tail" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.
+Let's try and find less noisy top n-grams in the new domains we've made available.
+%% Cell type:code id:b7a352d7 tags:
+``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+top10_2grams(maths)
+```
+%% Output
+    {('a', 'b'): 649,
+     ('b', 'a'): 202,
+     ('ligne', 'droite'): 193,
+     ('a', 'a'): 149,
+     ('1', 'degré'): 142,
+     ('2', 'degré'): 129,
+     ('b', 'b'): 110,
+     ('s.', 'f.'): 110,
+     ('angle', 'droit'): 102,
+     ('2', '3'): 97}
+%% Cell type:code id:20caa4fb tags:
+``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+top10_2grams(physique)
+```
+%% Output
+    {('a', 'b'): 370,
+     ('1', 'degré'): 301,
+     ('2', 'degré'): 298,
+     ('ligne', 'droite'): 235,
+     ('s.', 'm.'): 228,
+     ('3', 'degré'): 208,
+     ('s.', 'f.'): 197,
+     ('m.', 'newton'): 180,
+     ('quart', 'cercle'): 180,
+     ('grand', 'nombre'): 166}
+%% Cell type:markdown id:d00e8eac tags:
+Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above.
+%% Cell type:code id:8a3b1aba tags:
+``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+compareTop10_2gram(maths, physique)
+```
+%% Output
+    Intersection: 5
+    Scalar product: 0.6471193972004591
+    n-gram                        	Mathématiques       	Physique - [Sciences
+    ('s.', 'f.')                  	110                 	197
+    ('1', 'degré')                	142                 	301
+    ('ligne', 'droite')           	193                 	235
+    ('a', 'b')                    	649                 	370
+    ('2', 'degré')                	129                 	298
+    Norm                          	776.062497483289    	773.267741471219
+    Projected norm                	712.2885651195027   	640.5770835738663
+    Projection ratio              	0.9178237157824269  	0.8284026983397814
+%% Cell type:markdown id:b40c55f4 tags:
+Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:
+%% Cell type:code id:0fdc33ac tags:
+``` /gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python
+compareTop100_2gram(maths, physique)
+```
+%% Output
+    Intersection: 27
+    Scalar product: 0.5766009312352989
+    n-gram                        	Mathématiques       	Physique - [Sciences
+    ('a', 'donné')                	86                  	122
+    ('b', 'e')                    	93                  	51
+    ('infiniment', 'petit')       	64                  	46
+    ('point', 'a')                	42                  	84
+    ('où', 'ensuit')              	29                  	72
+    ('5', 'degré')                	37                  	82
+    ('1', '2')                    	94                  	50
+    ('1', 'degré')                	142                 	301
+    ('point', 'où')               	35                  	80
+    ('partie', 'égal')            	59                  	81
+    ('a', 'a')                    	149                 	57
+    ('m.', 'newton')              	48                  	180
+    ('2', 'degré')                	129                 	298
+    ('2', '3')                    	97                  	51
+    ('grand', 'nombre')           	43                  	166
+    ('s.', 'f.')                  	110                 	197
+    ('s.', 'm.')                  	91                  	228
+    ('3', 'degré')                	86                  	208
+    ('ligne', 'droite')           	193                 	235
+    ('a', 'point')                	57                  	121
+    ('angle', 'droit')            	102                 	52
+    ('4', 'degré')                	52                  	109
+    ('sinus', 'angle')            	45                  	67
+    ('ci', 'dessus')              	78                  	64
+    ('e', 'f')                    	44                  	96
+    ('a', 'b')                    	649                 	370
+    ('quart', 'cercle')           	61                  	180
+    Norm                          	923.9372273049722   	1006.8286845337691
+    Projected norm                	791.5649057405211   	839.5510705132833
+    Projection ratio              	0.8567301785743959  	0.8338569246286951
+%% Cell type:markdown id:57a1ee20 tags:
+Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.
+This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis.