diff --git a/notebooks/Scalar_product_vs_elements_intersection.ipynb b/notebooks/Scalar_product_vs_elements_intersection.ipynb index 830714c644b7b3a19ae9b784d492f8c08c8b1cbd..4a0460400914c82b95884e4005138259ccbb4da2 100644 --- a/notebooks/Scalar_product_vs_elements_intersection.ipynb +++ b/notebooks/Scalar_product_vs_elements_intersection.ipynb @@ -468,7 +468,7 @@ }, { "cell_type": "code", - "execution_count": 15, + "execution_count": null, "id": "c4ed3787", "metadata": {}, "outputs": [], @@ -513,7 +513,7 @@ }, { "cell_type": "code", - "execution_count": 17, + "execution_count": null, "id": "2621ff10", "metadata": {}, "outputs": [ @@ -547,7 +547,7 @@ "source": [ "We see two cases perfectly illustrated:\n", "\n", - "1. Belles-lettres and philo live in a very similar but are differently oriented, so they have a low normalized scalar-product\n", + "1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product\n", "2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n", "\n", "But does increasing the number of top ranks efficiently weakens the effect of the noise ?" @@ -555,7 +555,7 @@ }, { "cell_type": "code", - "execution_count": 18, + "execution_count": null, "id": "667fde5c", "metadata": {}, "outputs": [ @@ -605,7 +605,177 @@ "id": "ed4c2139", "metadata": {}, "source": [ - "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions." + "Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.\n", + "\n", + "Let's try and find less noisy top n-grams in the new domains we've made available." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "b7a352d7", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{('a', 'b'): 649,\n", + " ('b', 'a'): 202,\n", + " ('ligne', 'droite'): 193,\n", + " ('a', 'a'): 149,\n", + " ('1', 'degré'): 142,\n", + " ('2', 'degré'): 129,\n", + " ('b', 'b'): 110,\n", + " ('s.', 'f.'): 110,\n", + " ('angle', 'droit'): 102,\n", + " ('2', '3'): 97}" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top10_2grams(maths)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "20caa4fb", + "metadata": {}, + "outputs": [ + { + "data": { + "text/plain": [ + "{('a', 'b'): 370,\n", + " ('1', 'degré'): 301,\n", + " ('2', 'degré'): 298,\n", + " ('ligne', 'droite'): 235,\n", + " ('s.', 'm.'): 228,\n", + " ('3', 'degré'): 208,\n", + " ('s.', 'f.'): 197,\n", + " ('m.', 'newton'): 180,\n", + " ('quart', 'cercle'): 180,\n", + " ('grand', 'nombre'): 166}" + ] + }, + "execution_count": null, + "metadata": {}, + "output_type": "execute_result" + } + ], + "source": [ + "top10_2grams(physique)" + ] + }, + { + "cell_type": "markdown", + "id": "d00e8eac", + "metadata": {}, + "source": [ + "Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8a3b1aba", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Intersection: 5\n", + "Scalar product: 0.6471193972004591\n", + "\n", + "n-gram \tMathématiques \tPhysique - [Sciences\n", + "('s.', 'f.') \t110 \t197 \n", + "('1', 'degré') \t142 \t301 \n", + "('ligne', 'droite') \t193 \t235 \n", + "('a', 'b') \t649 \t370 \n", + "('2', 'degré') \t129 \t298 \n", + "\n", + "Norm \t776.062497483289 \t773.267741471219 \n", + "Projected norm \t712.2885651195027 \t640.5770835738663 \n", + "Projection ratio \t0.9178237157824269 \t0.8284026983397814 \n" + ] + } + ], + "source": [ + "compareTop10_2gram(maths, physique)" + ] + }, + { + "cell_type": "markdown", + "id": "b40c55f4", + "metadata": {}, + "source": [ + "Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0fdc33ac", + "metadata": {}, + "outputs": [ + { + "name": "stdout", + "output_type": "stream", + "text": [ + "Intersection: 27\n", + "Scalar product: 0.5766009312352989\n", + "\n", + "n-gram \tMathématiques \tPhysique - [Sciences\n", + "('a', 'donné') \t86 \t122 \n", + "('b', 'e') \t93 \t51 \n", + "('infiniment', 'petit') \t64 \t46 \n", + "('point', 'a') \t42 \t84 \n", + "('où', 'ensuit') \t29 \t72 \n", + "('5', 'degré') \t37 \t82 \n", + "('1', '2') \t94 \t50 \n", + "('1', 'degré') \t142 \t301 \n", + "('point', 'où') \t35 \t80 \n", + "('partie', 'égal') \t59 \t81 \n", + "('a', 'a') \t149 \t57 \n", + "('m.', 'newton') \t48 \t180 \n", + "('2', 'degré') \t129 \t298 \n", + "('2', '3') \t97 \t51 \n", + "('grand', 'nombre') \t43 \t166 \n", + "('s.', 'f.') \t110 \t197 \n", + "('s.', 'm.') \t91 \t228 \n", + "('3', 'degré') \t86 \t208 \n", + "('ligne', 'droite') \t193 \t235 \n", + "('a', 'point') \t57 \t121 \n", + "('angle', 'droit') \t102 \t52 \n", + "('4', 'degré') \t52 \t109 \n", + "('sinus', 'angle') \t45 \t67 \n", + "('ci', 'dessus') \t78 \t64 \n", + "('e', 'f') \t44 \t96 \n", + "('a', 'b') \t649 \t370 \n", + "('quart', 'cercle') \t61 \t180 \n", + "\n", + "Norm \t923.9372273049722 \t1006.8286845337691 \n", + "Projected norm \t791.5649057405211 \t839.5510705132833 \n", + "Projection ratio \t0.8567301785743959 \t0.8338569246286951 \n" + ] + } + ], + "source": [ + "compareTop100_2gram(maths, physique)" + ] + }, + { + "cell_type": "markdown", + "id": "57a1ee20", + "metadata": {}, + "source": [ + "Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.\n", + "\n", + "This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis." ] } ],