From d7e9f12b3a2da61db935b3bdefef3429fab92863 Mon Sep 17 00:00:00 2001
From: Alice BRENON <alice.brenon@ens-lyon.fr>
Date: Thu, 24 Mar 2022 16:01:01 +0100
Subject: [PATCH] Finish the analysis of scalar products vs. common elements
 count

---
 ...lar_product_vs_elements_intersection.ipynb | 180 +++++++++++++++++-
 1 file changed, 175 insertions(+), 5 deletions(-)

diff --git a/notebooks/Scalar_product_vs_elements_intersection.ipynb b/notebooks/Scalar_product_vs_elements_intersection.ipynb
index 830714c..4a04604 100644
--- a/notebooks/Scalar_product_vs_elements_intersection.ipynb
+++ b/notebooks/Scalar_product_vs_elements_intersection.ipynb
@@ -468,7 +468,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 15,
+   "execution_count": null,
    "id": "c4ed3787",
    "metadata": {},
    "outputs": [],
@@ -513,7 +513,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 17,
+   "execution_count": null,
    "id": "2621ff10",
    "metadata": {},
    "outputs": [
@@ -547,7 +547,7 @@
    "source": [
     "We see two cases perfectly illustrated:\n",
     "\n",
-    "1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product\n",
+    "1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product\n",
     "2. Architecture and MÃ©tiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n",
     "\n",
     "But does increasing the number of top ranks efficiently weakens the effect of the noise ?"
@@ -555,7 +555,7 @@
   },
   {
    "cell_type": "code",
-   "execution_count": 18,
+   "execution_count": null,
    "id": "667fde5c",
    "metadata": {},
    "outputs": [
@@ -605,7 +605,177 @@
    "id": "ed4c2139",
    "metadata": {},
    "source": [
-    "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions."
+    "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.\n",
+    "\n",
+    "Let's try and find less noisy top n-grams in the new domains we've made available."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "b7a352d7",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{('a', 'b'): 649,\n",
+       " ('b', 'a'): 202,\n",
+       " ('ligne', 'droite'): 193,\n",
+       " ('a', 'a'): 149,\n",
+       " ('1', 'degrÃ©'): 142,\n",
+       " ('2', 'degrÃ©'): 129,\n",
+       " ('b', 'b'): 110,\n",
+       " ('s.', 'f.'): 110,\n",
+       " ('angle', 'droit'): 102,\n",
+       " ('2', '3'): 97}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "top10_2grams(maths)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "20caa4fb",
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "{('a', 'b'): 370,\n",
+       " ('1', 'degrÃ©'): 301,\n",
+       " ('2', 'degrÃ©'): 298,\n",
+       " ('ligne', 'droite'): 235,\n",
+       " ('s.', 'm.'): 228,\n",
+       " ('3', 'degrÃ©'): 208,\n",
+       " ('s.', 'f.'): 197,\n",
+       " ('m.', 'newton'): 180,\n",
+       " ('quart', 'cercle'): 180,\n",
+       " ('grand', 'nombre'): 166}"
+      ]
+     },
+     "execution_count": null,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "top10_2grams(physique)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "d00e8eac",
+   "metadata": {},
+   "source": [
+    "Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting â€” is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and MÃ©tiers domains above."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "8a3b1aba",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Intersection: 5\n",
+      "Scalar product: 0.6471193972004591\n",
+      "\n",
+      "n-gram                        \tMathÃ©matiques       \tPhysique - [Sciences\n",
+      "('s.', 'f.')                  \t110                 \t197                 \n",
+      "('1', 'degrÃ©')                \t142                 \t301                 \n",
+      "('ligne', 'droite')           \t193                 \t235                 \n",
+      "('a', 'b')                    \t649                 \t370                 \n",
+      "('2', 'degrÃ©')                \t129                 \t298                 \n",
+      "\n",
+      "Norm                          \t776.062497483289    \t773.267741471219    \n",
+      "Projected norm                \t712.2885651195027   \t640.5770835738663   \n",
+      "Projection ratio              \t0.9178237157824269  \t0.8284026983397814  \n"
+     ]
+    }
+   ],
+   "source": [
+    "compareTop10_2gram(maths, physique)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "b40c55f4",
+   "metadata": {},
+   "source": [
+    "Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": null,
+   "id": "0fdc33ac",
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Intersection: 27\n",
+      "Scalar product: 0.5766009312352989\n",
+      "\n",
+      "n-gram                        \tMathÃ©matiques       \tPhysique - [Sciences\n",
+      "('a', 'donnÃ©')                \t86                  \t122                 \n",
+      "('b', 'e')                    \t93                  \t51                  \n",
+      "('infiniment', 'petit')       \t64                  \t46                  \n",
+      "('point', 'a')                \t42                  \t84                  \n",
+      "('oÃ¹', 'ensuit')              \t29                  \t72                  \n",
+      "('5', 'degrÃ©')                \t37                  \t82                  \n",
+      "('1', '2')                    \t94                  \t50                  \n",
+      "('1', 'degrÃ©')                \t142                 \t301                 \n",
+      "('point', 'oÃ¹')               \t35                  \t80                  \n",
+      "('partie', 'Ã©gal')            \t59                  \t81                  \n",
+      "('a', 'a')                    \t149                 \t57                  \n",
+      "('m.', 'newton')              \t48                  \t180                 \n",
+      "('2', 'degrÃ©')                \t129                 \t298                 \n",
+      "('2', '3')                    \t97                  \t51                  \n",
+      "('grand', 'nombre')           \t43                  \t166                 \n",
+      "('s.', 'f.')                  \t110                 \t197                 \n",
+      "('s.', 'm.')                  \t91                  \t228                 \n",
+      "('3', 'degrÃ©')                \t86                  \t208                 \n",
+      "('ligne', 'droite')           \t193                 \t235                 \n",
+      "('a', 'point')                \t57                  \t121                 \n",
+      "('angle', 'droit')            \t102                 \t52                  \n",
+      "('4', 'degrÃ©')                \t52                  \t109                 \n",
+      "('sinus', 'angle')            \t45                  \t67                  \n",
+      "('ci', 'dessus')              \t78                  \t64                  \n",
+      "('e', 'f')                    \t44                  \t96                  \n",
+      "('a', 'b')                    \t649                 \t370                 \n",
+      "('quart', 'cercle')           \t61                  \t180                 \n",
+      "\n",
+      "Norm                          \t923.9372273049722   \t1006.8286845337691  \n",
+      "Projected norm                \t791.5649057405211   \t839.5510705132833   \n",
+      "Projection ratio              \t0.8567301785743959  \t0.8338569246286951  \n"
+     ]
+    }
+   ],
+   "source": [
+    "compareTop100_2gram(maths, physique)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "id": "57a1ee20",
+   "metadata": {},
+   "source": [
+    "Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and MÃ©tiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and MÃ©tiers, this is not due to noise but to relevant n-grams. Like Architecture and MÃ©tiers, they too share their most fundamental features, but this time they are the ones which actually define them.\n",
+    "\n",
+    "This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis."
    ]
   }
  ],
-- 
GitLab