From d7e9f12b3a2da61db935b3bdefef3429fab92863 Mon Sep 17 00:00:00 2001 From: Alice BRENON <alice.brenon@ens-lyon.fr> Date: Thu, 24 Mar 2022 16:01:01 +0100 Subject: [PATCH] Finish the analysis of scalar products vs. common elements count --- ...lar_product_vs_elements_intersection.ipynb | 180 +++++++++++++++++- 1 file changed, 175 insertions(+), 5 deletions(-) diff --git a/notebooks/Scalar_product_vs_elements_intersection.ipynb b/notebooks/Scalar_product_vs_elements_intersection.ipynb index 830714c..4a04604 100644 --- a/notebooks/Scalar_product_vs_elements_intersection.ipynb +++ b/notebooks/Scalar_product_vs_elements_intersection.ipynb @@ -468,7 +468,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "id": "c4ed3787", "metadata": {}, "outputs": [], @@ -513,7 +513,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "id": "2621ff10", "metadata": {}, "outputs": [ @@ -547,7 +547,7 @@ "source": [ "We see two cases perfectly illustrated:\n", "\n", - "1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product\n", + "1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product\n", "2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n", "\n", "But does increasing the number of top ranks efficiently weakens the effect of the noise ?" @@ -555,7 +555,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "id": "667fde5c", "metadata": {}, "outputs": [ @@ -605,7 +605,177 @@ "id": "ed4c2139", "metadata": {}, "source": [ - "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions." + "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.\n", + "\n", + "Let's try and find less noisy top n-grams in the new domains we've made available." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7a352d7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{('a', 'b'): 649,\n", + " ('b', 'a'): 202,\n", + " ('ligne', 'droite'): 193,\n", + " ('a', 'a'): 149,\n", + " ('1', 'degré'): 142,\n", + " ('2', 'degré'): 129,\n", + " ('b', 'b'): 110,\n", + " ('s.', 'f.'): 110,\n", + " ('angle', 'droit'): 102,\n", + " ('2', '3'): 97}" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top10_2grams(maths)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20caa4fb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{('a', 'b'): 370,\n", + " ('1', 'degré'): 301,\n", + " ('2', 'degré'): 298,\n", + " ('ligne', 'droite'): 235,\n", + " ('s.', 'm.'): 228,\n", + " ('3', 'degré'): 208,\n", + " ('s.', 'f.'): 197,\n", + " ('m.', 'newton'): 180,\n", + " ('quart', 'cercle'): 180,\n", + " ('grand', 'nombre'): 166}" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top10_2grams(physique)" + ] + }, + { + "cell_type": "markdown", + "id": "d00e8eac", + "metadata": {}, + "source": [ + "Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a3b1aba", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Intersection: 5\n", + "Scalar product: 0.6471193972004591\n", + "\n", + "n-gram \tMathématiques \tPhysique - [Sciences\n", + "('s.', 'f.') \t110 \t197 \n", + "('1', 'degré') \t142 \t301 \n", + "('ligne', 'droite') \t193 \t235 \n", + "('a', 'b') \t649 \t370 \n", + "('2', 'degré') \t129 \t298 \n", + "\n", + "Norm \t776.062497483289 \t773.267741471219 \n", + "Projected norm \t712.2885651195027 \t640.5770835738663 \n", + "Projection ratio \t0.9178237157824269 \t0.8284026983397814 \n" + ] + } + ], + "source": [ + "compareTop10_2gram(maths, physique)" + ] + }, + { + "cell_type": "markdown", + "id": "b40c55f4", + "metadata": {}, + "source": [ + "Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fdc33ac", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Intersection: 27\n", + "Scalar product: 0.5766009312352989\n", + "\n", + "n-gram \tMathématiques \tPhysique - [Sciences\n", + "('a', 'donné') \t86 \t122 \n", + "('b', 'e') \t93 \t51 \n", + "('infiniment', 'petit') \t64 \t46 \n", + "('point', 'a') \t42 \t84 \n", + "('où', 'ensuit') \t29 \t72 \n", + "('5', 'degré') \t37 \t82 \n", + "('1', '2') \t94 \t50 \n", + "('1', 'degré') \t142 \t301 \n", + "('point', 'où') \t35 \t80 \n", + "('partie', 'égal') \t59 \t81 \n", + "('a', 'a') \t149 \t57 \n", + "('m.', 'newton') \t48 \t180 \n", + "('2', 'degré') \t129 \t298 \n", + "('2', '3') \t97 \t51 \n", + "('grand', 'nombre') \t43 \t166 \n", + "('s.', 'f.') \t110 \t197 \n", + "('s.', 'm.') \t91 \t228 \n", + "('3', 'degré') \t86 \t208 \n", + "('ligne', 'droite') \t193 \t235 \n", + "('a', 'point') \t57 \t121 \n", + "('angle', 'droit') \t102 \t52 \n", + "('4', 'degré') \t52 \t109 \n", + "('sinus', 'angle') \t45 \t67 \n", + "('ci', 'dessus') \t78 \t64 \n", + "('e', 'f') \t44 \t96 \n", + "('a', 'b') \t649 \t370 \n", + "('quart', 'cercle') \t61 \t180 \n", + "\n", + "Norm \t923.9372273049722 \t1006.8286845337691 \n", + "Projected norm \t791.5649057405211 \t839.5510705132833 \n", + "Projection ratio \t0.8567301785743959 \t0.8338569246286951 \n" + ] + } + ], + "source": [ + "compareTop100_2gram(maths, physique)" + ] + }, + { + "cell_type": "markdown", + "id": "57a1ee20", + "metadata": {}, + "source": [ + "Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.\n", + "\n", + "This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis." ] } ], -- GitLab