Scalar_product_vs_elements_intersection.ipynb

{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3401c7f3",
   "metadata": {},
   "source": [
    "# So what's wrong with scalar product vs. elements intersection ?\n",
    "\n",
    "A short notebook to explain the \"shift\" in classes comparisons obtained from [that other notebook](Confusion%20Matrix.ipynb). Looking at the matrices produced we observe that the scalar product, which can be seen intuitively as more restrictive a metrics than elements intersection (counting the number of elements in the intersections of two sets can be viewed as a scalar product of vectors where all elements in each set are associated to the integer `1`, so you'd expect the more nuanced scalar product to be systematically inferior to that all-or-nothing metrics), sometimes make classes appear more similar than computing the number of n-grams they have in common. Let's see why, starting from an illustration of the issue at stake:\n",
    "\n",
    "![The pairs (Belles-lettres / Philosophie) and (Architecture / Métiers)](../img/colinearity_vs_keys_intersections.png)\n",
    "\n",
    "Both pictures are confusion matrices comparing the top 10 2-grams of each domain in the EDdA. The one on the left merely counts the number of 2-grams common between both classes whereas the one on the right uses the scalar product computed on the vectors obtained for each domain, by associating to each one of the most frequent 2-grams found in that domain its  number of occurrences.\n",
    "\n",
    "The cells circled in green, at the intersection between Belles-lettres and Philosophie, presents the expected behaviour: The blue circle is darker on the left, where they elements are simply counted, and appear lighter when the matrix is computed with the metrics derived from the scalar product. The one in purple (Architecture / Métiers), however, is counter-intuitive: the scalar product method makes them appear closer than they were with the elements intersection method. To understand how this is possible, let's look at the top 10 2-grams of the involved classes."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7cf51dd1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from EDdA import data\n",
    "from EDdA.classification import topNGrams"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "30dfceaa",
   "metadata": {},
   "outputs": [],
   "source": [
    "source = data.load('training_set')\n",
    "top10_2grams = topNGrams(source, 2, 10)\n",
    "\n",
    "architecture = 'Architecture'\n",
    "belles_lettres = 'Belles-lettres - Poésie'\n",
    "metiers = 'Métiers'\n",
    "philo = 'Philosophie'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b6f895c4",
   "metadata": {},
   "source": [
    "We have everything ready to display the vectors. Let's start with the Belles-lettres:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ee7a9ab9",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('d.', 'j.'): 485,\n",
       " ('s.', 'm.'): 196,\n",
       " ('s.', 'f.'): 145,\n",
       " ('chez', 'romain'): 71,\n",
       " ('a', 'point'): 67,\n",
       " ('-t', '-il'): 62,\n",
       " ('1', 'degré'): 58,\n",
       " ('grand', 'nombre'): 57,\n",
       " ('sans', 'doute'): 54,\n",
       " ('sou', 'nom'): 54}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top10_2grams(belles_lettres)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3dcc0f86",
   "metadata": {},
   "source": [
    "We first notice the occurrences of the 2-gram ('s.', 'm.') and ('s.', 'f.') for \"substantif masculin\" and \"substantif féminin\", which occur very frequently at the begining of article. Those are of course unwanted, irrelevant bigrams but they haven't been filtered out (yet ?). Let's look at the Philosophie domain now:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d3397e87",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('a', 'point'): 191,\n",
       " ('s.', 'f.'): 142,\n",
       " ('1', 'degré'): 136,\n",
       " ('2', 'degré'): 131,\n",
       " ('-t', '-il'): 131,\n",
       " ('grand', 'nombre'): 116,\n",
       " ('dieu', 'a'): 100,\n",
       " ('sans', 'doute'): 89,\n",
       " ('3', 'degré'): 88,\n",
       " ('d.', 'j.'): 82}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top10_2grams(philo)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "41d147a0",
   "metadata": {},
   "source": [
    "Interestingly enough, the Philosophie domain seems to comprise much fewer masculine substantives, so that ('s.', 'm.') didn't even make the top 10. The ('d.', 'j.') bigram is there too (probably the signature of an author ?), as well as ('-t', '-il'), ('1', 'degré'), ('a', 'point'), ('grand', 'nombre') and ('sans', 'doute'). Quite a populated intersection which account for the relatively dark blue patch in the matrix on the left (7/10)."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "320cf92e",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('-t', '-il'),\n",
       " ('1', 'degré'),\n",
       " ('a', 'point'),\n",
       " ('d.', 'j.'),\n",
       " ('grand', 'nombre'),\n",
       " ('s.', 'f.'),\n",
       " ('sans', 'doute')}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(top10_2grams(belles_lettres)).intersection(top10_2grams(philo))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6747e51b",
   "metadata": {},
   "source": [
    "Now if we look at their (normalized) scalar product, though, the result is pretty average:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "911d6a17",
   "metadata": {},
   "outputs": [],
   "source": [
    "from EDdA.classification import colinearity"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fa76ac3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.4508503933694939"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "colinearity(top10_2grams(belles_lettres), top10_2grams(philo))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "36532f01",
   "metadata": {},
   "source": [
    "Indeed, if we look at the cardinalities associated to each 2-gram they share, they aren't synchronized but rather in phase opposition instead:\n",
    "\n",
    "- the most frequent in belles_lettres, ('d.', 'j.') (485) is the least frequent in philo (82)\n",
    "- the most frequent in philo, ('a', 'point') (191) is rather low in belles_lettres (67)\n",
    "- the trend is general to the other 2-grams, except maybe ('s.', 'f.') which scores about the same in both domains\n",
    "\n",
    "As a result, their contributions, if they do not entirely cancel out, at least do not manage to produce a dot product value high enough to overcome their norms, and the normalized scalar product is not very high."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e1fccadd",
   "metadata": {},
   "source": [
    "Now looking at the other pair reveals a different story:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "a26a0f41",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('d.', 'j.'): 325,\n",
       " ('s.', 'm.'): 323,\n",
       " ('s.', 'f.'): 194,\n",
       " ('daviler', 'd.'): 65,\n",
       " ('plate', 'bande'): 56,\n",
       " ('vers', 'act'): 50,\n",
       " ('piece', 'bois'): 40,\n",
       " ('pierre', 'dure'): 35,\n",
       " ('porte', 'croisée'): 30,\n",
       " ('a', 'donné'): 30}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top10_2grams(architecture)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ad912f85",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('d.', 'j.'): 631,\n",
       " ('s.', 'm.'): 612,\n",
       " ('s.', 'f.'): 438,\n",
       " ('morceau', 'bois'): 200,\n",
       " ('a', 'b'): 164,\n",
       " ('pieces', 'bois'): 161,\n",
       " ('piece', 'bois'): 155,\n",
       " ('fig', '1'): 139,\n",
       " ('fig', '2'): 139,\n",
       " ('or', 'argent'): 137}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top10_2grams(metiers)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "4d8495df",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('d.', 'j.'), ('piece', 'bois'), ('s.', 'f.'), ('s.', 'm.')}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "set(top10_2grams(architecture)).intersection(top10_2grams(metiers))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a75a2b67",
   "metadata": {},
   "source": [
    "They only have four 2-grams in commons, but now three out of the four are the top 3 of each domain. That is to say, the \"coordinates\" that contribute the most to their norms are (almost all) the ones they have in common. Moreover, even the distribution in that top 3 is similar from one domain to the other:\n",
    "\n",
    "- they are in the same order: ('d.', 'j.'), ('s.', 'm.'), then ('s.', 'f.')\n",
    "- their frequencies are almost proportional, with a ratio close to 0.5"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ba0d6e3",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "[0.5150554675118859, 0.5277777777777778, 0.4429223744292237]"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "list(map(lambda p: p[0] / p[1], zip([325, 323, 194], [631, 612, 438])))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "7919c2c6",
   "metadata": {},
   "source": [
    "For these reasons, their scalar product is much closer to 1:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6d9bffda",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "0.9041105011801019"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "colinearity(top10_2grams(architecture), top10_2grams(metiers))"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e63ef397",
   "metadata": {},
   "source": [
    "Of course, looking at these top 3 2-grams themselves is also very telling: they are not specific at all, there are the two substantive one we've previously noticed and the possible signature ('d.', 'j.'), so in fact, in addition to being common between Architecture and Métier, they are also shared by Belles-lettres, and two of them are found in Philosophie. They are rather like noise, found in apparently many domains.\n",
    "\n",
    "In any case, we have an explanation of how this counter-intuitive situation can occur: in terms of algebra, they appear to live in very different (10-D) spaces, but they actually have very low components on the 7 dimensions they don't share, on which they hardly depart from their main, 3-D components, which they share. In that 3-D space where they almost live, they have in addition a very similar direction, and thus their scalar product is very high.\n",
    "\n",
    "Put more simply, it means that both vectors have very little in common, but what they have in common outweights the rest, so that their most important feature is what they share. Let's generalize this approach by defining a generic routine to perform the comparison, evaluating the number of n-grams two domains share, their scalar product, and the contribution of each common n-gram to the norm of the vector.\n",
    "\n",
    "We'll first need to be able to project a vector on a given space (in practice, we'll use it to project on the n-grams it shares with another vector, that is, their \"common subspace\")."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "3254c7c6",
   "metadata": {},
   "outputs": [],
   "source": [
    "def projector(vector, base):\n",
    "    return dict([(k, vector[k]) for k in base if k in vector])"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c950da4c",
   "metadata": {},
   "source": [
    "Given a vectorizer (`top10_2grams` in the above for instance), we can now define our main tool:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c69b90e6",
   "metadata": {},
   "outputs": [],
   "source": [
    "from EDdA.classification.classSimilarities import norm\n",
    "\n",
    "def studyIntersection(vectorizer):\n",
    "    def compare(domain1, domain2):\n",
    "        vector1, vector2 = vectorizer(domain1), vectorizer(domain2)\n",
    "        intersection = set(vector1).intersection(vector2)\n",
    "        projected1, projected2 = projector(vector1, intersection), projector(vector2, intersection)\n",
    "        print(f\"Intersection: {len(intersection)}\")\n",
    "        print(f\"Scalar product: {colinearity(vector1, vector2)}\\n\")\n",
    "        print(f\"{'n-gram': <30}\\t{domain1[:20]: <20}\\t{domain2[:20]: <20}\")\n",
    "        for ngram in intersection:\n",
    "            print(f\"{str(ngram)[:30]: <30}\\t{vector1[ngram]: <20}\\t{vector2[ngram]: <20}\")\n",
    "        norm1, norm2 = norm(vector1), norm(vector2)\n",
    "        projNorm1, projNorm2 = norm(projected1), norm(projected2)\n",
    "        print(\"\")\n",
    "        print(f\"{'Norm': <30}\\t{norm1: <20}\\t{norm2: <20}\")\n",
    "        print(f\"{'Projected norm': <30}\\t{projNorm1: <20}\\t{projNorm2: <20}\")\n",
    "        print(f\"{'Projection ratio': <30}\\t{projNorm1 / norm1: <20}\\t{projNorm2 / norm2: <20}\")\n",
    "    return compare"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "daef48e6",
   "metadata": {},
   "source": [
    "So let us declare other vectorizers to be able to study the influence of the size of n-grams and the number of ranks kept for comparison."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "d152e0fc",
   "metadata": {},
   "outputs": [],
   "source": [
    "top100_2grams = topNGrams(source, 2, 100)\n",
    "top10_3grams = topNGrams(source, 3, 10)\n",
    "top100_3grams = topNGrams(source, 3, 100)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "cd10a9f6",
   "metadata": {},
   "source": [
    "Also, let's work on more domains:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea9174b6",
   "metadata": {},
   "outputs": [],
   "source": [
    "maths = 'Mathématiques'\n",
    "mesure = 'Mesure'\n",
    "physique = 'Physique - [Sciences physico-mathématiques]'"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "6e6b8cbd",
   "metadata": {},
   "source": [
    "We can get a different view of the situation we've studied above with our new tool."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c4ed3787",
   "metadata": {},
   "outputs": [],
   "source": [
    "compareTop10_2gram = studyIntersection(top10_2grams)\n",
    "compareTop100_2gram = studyIntersection(top100_2grams)\n",
    "compareTop10_3gram = studyIntersection(top10_3grams)\n",
    "compareTop100_3gram = studyIntersection(top100_3grams)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "7abde2b3",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intersection: 7\n",
      "Scalar product: 0.4508503933694939\n",
      "\n",
      "n-gram                        \tBelles-lettres - Poé\tPhilosophie         \n",
      "('-t', '-il')                 \t62                  \t131                 \n",
      "('s.', 'f.')                  \t145                 \t142                 \n",
      "('sans', 'doute')             \t54                  \t89                  \n",
      "('d.', 'j.')                  \t485                 \t82                  \n",
      "('a', 'point')                \t67                  \t191                 \n",
      "('grand', 'nombre')           \t57                  \t116                 \n",
      "('1', 'degré')                \t58                  \t136                 \n",
      "\n",
      "Norm                          \t566.1139461274558   \t394.0913599661886   \n",
      "Projected norm                \t523.5570647025976   \t346.991354359154    \n",
      "Projection ratio              \t0.9248262973983034  \t0.8804845515743467  \n"
     ]
    }
   ],
   "source": [
    "compareTop10_2gram(belles_lettres, philo)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "2621ff10",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intersection: 4\n",
      "Scalar product: 0.9041105011801019\n",
      "\n",
      "n-gram                        \tArchitecture        \tMétiers             \n",
      "('s.', 'm.')                  \t323                 \t612                 \n",
      "('d.', 'j.')                  \t325                 \t631                 \n",
      "('piece', 'bois')             \t40                  \t155                 \n",
      "('s.', 'f.')                  \t194                 \t438                 \n",
      "\n",
      "Norm                          \t511.93358944300576  \t1067.1466628350574  \n",
      "Projected norm                \t499.18934283496077  \t994.2705869128383   \n",
      "Projection ratio              \t0.9751056643462075  \t0.9317094093434061  \n"
     ]
    }
   ],
   "source": [
    "compareTop10_2gram(architecture, metiers)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3559d324",
   "metadata": {},
   "source": [
    "We see two cases perfectly illustrated:\n",
    "\n",
    "1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product\n",
    "2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n",
    "\n",
    "But does increasing the number of top ranks efficiently weakens the effect of the noise ?"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "667fde5c",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intersection: 22\n",
      "Scalar product: 0.7998536585283141\n",
      "\n",
      "n-gram                        \tArchitecture        \tMétiers             \n",
      "('haut', 'bas')               \t11                  \t66                  \n",
      "('m.', 'pl')                  \t29                  \t51                  \n",
      "('barre', 'fer')              \t20                  \t70                  \n",
      "('a', 'plusieurs')            \t14                  \t45                  \n",
      "('vers', 'act')               \t50                  \t110                 \n",
      "('endroit', 'où')             \t29                  \t87                  \n",
      "('sou', 'nom')                \t29                  \t56                  \n",
      "('s.', 'm.')                  \t323                 \t612                 \n",
      "('m.', 'espece')              \t14                  \t40                  \n",
      "('piece', 'bois')             \t40                  \t155                 \n",
      "('où', 'a')                   \t26                  \t93                  \n",
      "('chaque', 'côté')            \t13                  \t73                  \n",
      "('pouce', 'épaisseur')        \t15                  \t52                  \n",
      "('a', 'point')                \t17                  \t83                  \n",
      "('a', 'b')                    \t22                  \t164                 \n",
      "('1', 'degré')                \t14                  \t95                  \n",
      "('a', 'donné')                \t30                  \t49                  \n",
      "('s.', 'f.')                  \t194                 \t438                 \n",
      "('pieces', 'bois')            \t26                  \t161                 \n",
      "('d.', 'j.')                  \t325                 \t631                 \n",
      "('grand', 'nombre')           \t18                  \t67                  \n",
      "('2', 'degré')                \t13                  \t95                  \n",
      "\n",
      "Norm                          \t535.4941643006018   \t1230.2991506133783  \n",
      "Projected norm                \t509.15027251293895  \t1062.1624169589131  \n",
      "Projection ratio              \t0.9508045212368091  \t0.8633367067102022  \n"
     ]
    }
   ],
   "source": [
    "compareTop100_2gram(architecture, metiers)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "ed4c2139",
   "metadata": {},
   "source": [
    "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.\n",
    "\n",
    "Let's try and find less noisy top n-grams in the new domains we've made available."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "b7a352d7",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('a', 'b'): 649,\n",
       " ('b', 'a'): 202,\n",
       " ('ligne', 'droite'): 193,\n",
       " ('a', 'a'): 149,\n",
       " ('1', 'degré'): 142,\n",
       " ('2', 'degré'): 129,\n",
       " ('b', 'b'): 110,\n",
       " ('s.', 'f.'): 110,\n",
       " ('angle', 'droit'): 102,\n",
       " ('2', '3'): 97}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top10_2grams(maths)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "20caa4fb",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{('a', 'b'): 370,\n",
       " ('1', 'degré'): 301,\n",
       " ('2', 'degré'): 298,\n",
       " ('ligne', 'droite'): 235,\n",
       " ('s.', 'm.'): 228,\n",
       " ('3', 'degré'): 208,\n",
       " ('s.', 'f.'): 197,\n",
       " ('m.', 'newton'): 180,\n",
       " ('quart', 'cercle'): 180,\n",
       " ('grand', 'nombre'): 166}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "top10_2grams(physique)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d00e8eac",
   "metadata": {},
   "source": [
    "Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8a3b1aba",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intersection: 5\n",
      "Scalar product: 0.6471193972004591\n",
      "\n",
      "n-gram                        \tMathématiques       \tPhysique - [Sciences\n",
      "('s.', 'f.')                  \t110                 \t197                 \n",
      "('1', 'degré')                \t142                 \t301                 \n",
      "('ligne', 'droite')           \t193                 \t235                 \n",
      "('a', 'b')                    \t649                 \t370                 \n",
      "('2', 'degré')                \t129                 \t298                 \n",
      "\n",
      "Norm                          \t776.062497483289    \t773.267741471219    \n",
      "Projected norm                \t712.2885651195027   \t640.5770835738663   \n",
      "Projection ratio              \t0.9178237157824269  \t0.8284026983397814  \n"
     ]
    }
   ],
   "source": [
    "compareTop10_2gram(maths, physique)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "b40c55f4",
   "metadata": {},
   "source": [
    "Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "0fdc33ac",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Intersection: 27\n",
      "Scalar product: 0.5766009312352989\n",
      "\n",
      "n-gram                        \tMathématiques       \tPhysique - [Sciences\n",
      "('a', 'donné')                \t86                  \t122                 \n",
      "('b', 'e')                    \t93                  \t51                  \n",
      "('infiniment', 'petit')       \t64                  \t46                  \n",
      "('point', 'a')                \t42                  \t84                  \n",
      "('où', 'ensuit')              \t29                  \t72                  \n",
      "('5', 'degré')                \t37                  \t82                  \n",
      "('1', '2')                    \t94                  \t50                  \n",
      "('1', 'degré')                \t142                 \t301                 \n",
      "('point', 'où')               \t35                  \t80                  \n",
      "('partie', 'égal')            \t59                  \t81                  \n",
      "('a', 'a')                    \t149                 \t57                  \n",
      "('m.', 'newton')              \t48                  \t180                 \n",
      "('2', 'degré')                \t129                 \t298                 \n",
      "('2', '3')                    \t97                  \t51                  \n",
      "('grand', 'nombre')           \t43                  \t166                 \n",
      "('s.', 'f.')                  \t110                 \t197                 \n",
      "('s.', 'm.')                  \t91                  \t228                 \n",
      "('3', 'degré')                \t86                  \t208                 \n",
      "('ligne', 'droite')           \t193                 \t235                 \n",
      "('a', 'point')                \t57                  \t121                 \n",
      "('angle', 'droit')            \t102                 \t52                  \n",
      "('4', 'degré')                \t52                  \t109                 \n",
      "('sinus', 'angle')            \t45                  \t67                  \n",
      "('ci', 'dessus')              \t78                  \t64                  \n",
      "('e', 'f')                    \t44                  \t96                  \n",
      "('a', 'b')                    \t649                 \t370                 \n",
      "('quart', 'cercle')           \t61                  \t180                 \n",
      "\n",
      "Norm                          \t923.9372273049722   \t1006.8286845337691  \n",
      "Projected norm                \t791.5649057405211   \t839.5510705132833   \n",
      "Projection ratio              \t0.8567301785743959  \t0.8338569246286951  \n"
     ]
    }
   ],
   "source": [
    "compareTop100_2gram(maths, physique)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "57a1ee20",
   "metadata": {},
   "source": [
    "Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.\n",
    "\n",
    "This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis."
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3",
   "language": "/gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.9"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}