Newer
Older
{
"cells": [
{
"cell_type": "markdown",
"id": "3401c7f3",
"metadata": {},
"source": [
"# So what's wrong with scalar product vs. elements intersection ?\n",
"\n",
"A short notebook to explain the \"shift\" in classes comparisons obtained from [that other notebook](Confusion%20Matrix.ipynb). Looking at the matrices produced we observe that the scalar product, which can be seen intuitively as more restrictive a metrics than elements intersection (counting the number of elements in the intersections of two sets can be viewed as a scalar product of vectors where all elements in each set are associated to the integer `1`, so you'd expect the more nuanced scalar product to be systematically inferior to that all-or-nothing metrics), sometimes make classes appear more similar than computing the number of n-grams they have in common. Let's see why, starting from an illustration of the issue at stake:\n",
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
"\n",
"\n",
"\n",
"Both pictures are confusion matrices comparing the top 10 2-grams of each domain in the EDdA. The one on the left merely counts the number of 2-grams common between both classes whereas the one on the right uses the scalar product computed on the vectors obtained for each domain, by associating to each one of the most frequent 2-grams found in that domain its number of occurrences.\n",
"\n",
"The cells circled in green, at the intersection between Belles-lettres and Philosophie, presents the expected behaviour: The blue circle is darker on the left, where they elements are simply counted, and appear lighter when the matrix is computed with the metrics derived from the scalar product. The one in purple (Architecture / Métiers), however, is counter-intuitive: the scalar product method makes them appear closer than they were with the elements intersection method. To understand how this is possible, let's look at the top 10 2-grams of the involved classes."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7cf51dd1",
"metadata": {},
"outputs": [],
"source": [
"from EDdA import data\n",
"from EDdA.classification import topNGrams"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "30dfceaa",
"metadata": {},
"outputs": [],
"source": [
"source = data.load('training_set')\n",
"top10_2grams = topNGrams(source, 2, 10)\n",
"\n",
"architecture = 'Architecture'\n",
"belles_lettres = 'Belles-lettres - Poésie'\n",
"metiers = 'Métiers'\n",
"philo = 'Philosophie'"
]
},
{
"cell_type": "markdown",
"id": "b6f895c4",
"metadata": {},
"source": [
"We have everything ready to display the vectors. Let's start with the Belles-lettres:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ee7a9ab9",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'): 485,\n",
" ('s.', 'm.'): 196,\n",
" ('s.', 'f.'): 145,\n",
" ('chez', 'romain'): 71,\n",
" ('a', 'point'): 67,\n",
" ('-t', '-il'): 62,\n",
" ('1', 'degré'): 58,\n",
" ('grand', 'nombre'): 57,\n",
" ('sans', 'doute'): 54,\n",
" ('sou', 'nom'): 54}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(belles_lettres)"
]
},
{
"cell_type": "markdown",
"id": "3dcc0f86",
"metadata": {},
"source": [
"We first notice the occurrences of the 2-gram ('s.', 'm.') and ('s.', 'f.') for \"substantif masculin\" and \"substantif féminin\", which occur very frequently at the begining of article. Those are of course unwanted, irrelevant bigrams but they haven't been filtered out (yet ?). Let's look at the Philosophie domain now:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d3397e87",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('a', 'point'): 191,\n",
" ('s.', 'f.'): 142,\n",
" ('1', 'degré'): 136,\n",
" ('2', 'degré'): 131,\n",
" ('-t', '-il'): 131,\n",
" ('grand', 'nombre'): 116,\n",
" ('dieu', 'a'): 100,\n",
" ('sans', 'doute'): 89,\n",
" ('3', 'degré'): 88,\n",
" ('d.', 'j.'): 82}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(philo)"
]
},
{
"cell_type": "markdown",
"id": "41d147a0",
"metadata": {},
"source": [
"Interestingly enough, the Philosophie domain seems to comprise much fewer masculine substantives, so that ('s.', 'm.') didn't even make the top 10. The ('d.', 'j.') bigram is there too (probably the signature of an author ?), as well as ('-t', '-il'), ('1', 'degré'), ('a', 'point'), ('grand', 'nombre') and ('sans', 'doute'). Quite a populated intersection which account for the relatively dark blue patch in the matrix on the left (7/10)."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "320cf92e",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('-t', '-il'),\n",
" ('1', 'degré'),\n",
" ('a', 'point'),\n",
" ('d.', 'j.'),\n",
" ('grand', 'nombre'),\n",
" ('s.', 'f.'),\n",
" ('sans', 'doute')}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set(top10_2grams(belles_lettres)).intersection(top10_2grams(philo))"
]
},
{
"cell_type": "markdown",
"id": "6747e51b",
"metadata": {},
"source": [
"Now if we look at their (normalized) scalar product, though, the result is pretty average:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "911d6a17",
"metadata": {},
"outputs": [],
"source": [
"from EDdA.classification import colinearity"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fa76ac3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.4508503933694939"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"colinearity(top10_2grams(belles_lettres), top10_2grams(philo))"
]
},
{
"cell_type": "markdown",
"id": "36532f01",
"metadata": {},
"source": [
"Indeed, if we look at the cardinalities associated to each 2-gram they share, they aren't synchronized but rather in phase opposition instead:\n",
"\n",
"- the most frequent in belles_lettres, ('d.', 'j.') (485) is the least frequent in philo (82)\n",
"- the most frequent in philo, ('a', 'point') (191) is rather low in belles_lettres (67)\n",
"- the trend is general to the other 2-grams, except maybe ('s.', 'f.') which scores about the same in both domains\n",
"\n",
"As a result, their contributions, if they do not entirely cancel out, at least do not manage to produce a dot product value high enough to overcome their norms, and the normalized scalar product is not very high."
]
},
{
"cell_type": "markdown",
"id": "e1fccadd",
"metadata": {},
"source": [
"Now looking at the other pair reveals a different story:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "a26a0f41",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'): 325,\n",
" ('s.', 'm.'): 323,\n",
" ('s.', 'f.'): 194,\n",
" ('daviler', 'd.'): 65,\n",
" ('plate', 'bande'): 56,\n",
" ('vers', 'act'): 50,\n",
" ('piece', 'bois'): 40,\n",
" ('pierre', 'dure'): 35,\n",
" ('porte', 'croisée'): 30,\n",
" ('a', 'donné'): 30}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(architecture)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ad912f85",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'): 631,\n",
" ('s.', 'm.'): 612,\n",
" ('s.', 'f.'): 438,\n",
" ('morceau', 'bois'): 200,\n",
" ('a', 'b'): 164,\n",
" ('pieces', 'bois'): 161,\n",
" ('piece', 'bois'): 155,\n",
" ('fig', '1'): 139,\n",
" ('fig', '2'): 139,\n",
" ('or', 'argent'): 137}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(metiers)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "4d8495df",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('d.', 'j.'), ('piece', 'bois'), ('s.', 'f.'), ('s.', 'm.')}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"set(top10_2grams(architecture)).intersection(top10_2grams(metiers))"
]
},
{
"cell_type": "markdown",
"id": "a75a2b67",
"metadata": {},
"source": [
"They only have four 2-grams in commons, but now three out of the four are the top 3 of each domain. That is to say, the \"coordinates\" that contribute the most to their norms are (almost all) the ones they have in common. Moreover, even the distribution in that top 3 is similar from one domain to the other:\n",
"\n",
"- they are in the same order: ('d.', 'j.'), ('s.', 'm.'), then ('s.', 'f.')\n",
"- their frequencies are almost proportional, with a ratio close to 0.5"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8ba0d6e3",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"[0.5150554675118859, 0.5277777777777778, 0.4429223744292237]"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"list(map(lambda p: p[0] / p[1], zip([325, 323, 194], [631, 612, 438])))"
]
},
{
"cell_type": "markdown",
"id": "7919c2c6",
"metadata": {},
"source": [
"For these reasons, their scalar product is much closer to 1:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "6d9bffda",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"0.9041105011801019"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"colinearity(top10_2grams(architecture), top10_2grams(metiers))"
]
},
{
"cell_type": "markdown",
"id": "e63ef397",
"metadata": {},
"source": [
"Of course, looking at these top 3 2-grams themselves is also very telling: they are not specific at all, there are the two substantive one we've previously noticed and the possible signature ('d.', 'j.'), so in fact, in addition to being common between Architecture and Métier, they are also shared by Belles-lettres, and two of them are found in Philosophie. They are rather like noise, found in apparently many domains.\n",
"\n",
"In any case, we have an explanation of how this counter-intuitive situation can occur: in terms of algebra, they appear to live in very different (10-D) spaces, but they actually have very low components on the 7 dimensions they don't share, on which they hardly depart from their main, 3-D components, which they share. In that 3-D space where they almost live, they have in addition a very similar direction, and thus their scalar product is very high.\n",
"\n",
"Put more simply, it means that both vectors have very little in common, but what they have in common outweights the rest, so that their most important feature is what they share. Let's generalize this approach by defining a generic routine to perform the comparison, evaluating the number of n-grams two domains share, their scalar product, and the contribution of each common n-gram to the norm of the vector.\n",
"\n",
"We'll first need to be able to project a vector on a given space (in practice, we'll use it to project on the n-grams it shares with another vector, that is, their \"common subspace\")."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "3254c7c6",
"metadata": {},
"outputs": [],
"source": [
"def projector(vector, base):\n",
" return dict([(k, vector[k]) for k in base if k in vector])"
]
},
{
"cell_type": "markdown",
"id": "c950da4c",
"metadata": {},
"source": [
"Given a vectorizer (`top10_2grams` in the above for instance), we can now define our main tool:"
]
},
{
"cell_type": "code",
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
"execution_count": null,
"id": "c69b90e6",
"metadata": {},
"outputs": [],
"source": [
"from EDdA.classification.classSimilarities import norm\n",
"\n",
"def studyIntersection(vectorizer):\n",
" def compare(domain1, domain2):\n",
" vector1, vector2 = vectorizer(domain1), vectorizer(domain2)\n",
" intersection = set(vector1).intersection(vector2)\n",
" projected1, projected2 = projector(vector1, intersection), projector(vector2, intersection)\n",
" print(f\"Intersection: {len(intersection)}\")\n",
" print(f\"Scalar product: {colinearity(vector1, vector2)}\\n\")\n",
" print(f\"{'n-gram': <30}\\t{domain1[:20]: <20}\\t{domain2[:20]: <20}\")\n",
" for ngram in intersection:\n",
" print(f\"{str(ngram)[:30]: <30}\\t{vector1[ngram]: <20}\\t{vector2[ngram]: <20}\")\n",
" norm1, norm2 = norm(vector1), norm(vector2)\n",
" projNorm1, projNorm2 = norm(projected1), norm(projected2)\n",
" print(\"\")\n",
" print(f\"{'Norm': <30}\\t{norm1: <20}\\t{norm2: <20}\")\n",
" print(f\"{'Projected norm': <30}\\t{projNorm1: <20}\\t{projNorm2: <20}\")\n",
" print(f\"{'Projection ratio': <30}\\t{projNorm1 / norm1: <20}\\t{projNorm2 / norm2: <20}\")\n",
" return compare"
]
},
{
"cell_type": "markdown",
"id": "daef48e6",
"metadata": {},
"source": [
"So let us declare other vectorizers to be able to study the influence of the size of n-grams and the number of ranks kept for comparison."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "d152e0fc",
"metadata": {},
"outputs": [],
"source": [
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
"top100_2grams = topNGrams(source, 2, 100)\n",
"top10_3grams = topNGrams(source, 3, 10)\n",
"top100_3grams = topNGrams(source, 3, 100)"
]
},
{
"cell_type": "markdown",
"id": "cd10a9f6",
"metadata": {},
"source": [
"Also, let's work on more domains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "ea9174b6",
"metadata": {},
"outputs": [],
"source": [
"maths = 'Mathématiques'\n",
"mesure = 'Mesure'\n",
"physique = 'Physique - [Sciences physico-mathématiques]'"
]
},
{
"cell_type": "markdown",
"id": "6e6b8cbd",
"metadata": {},
"source": [
"We can get a different view of the situation we've studied above with our new tool."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "c4ed3787",
"metadata": {},
"outputs": [],
"source": [
"compareTop10_2gram = studyIntersection(top10_2grams)\n",
"compareTop100_2gram = studyIntersection(top100_2grams)\n",
"compareTop10_3gram = studyIntersection(top10_3grams)\n",
"compareTop100_3gram = studyIntersection(top100_3grams)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "7abde2b3",
"metadata": {},
"outputs": [
{
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 7\n",
"Scalar product: 0.4508503933694939\n",
"\n",
"n-gram \tBelles-lettres - Poé\tPhilosophie \n",
"('-t', '-il') \t62 \t131 \n",
"('s.', 'f.') \t145 \t142 \n",
"('sans', 'doute') \t54 \t89 \n",
"('d.', 'j.') \t485 \t82 \n",
"('a', 'point') \t67 \t191 \n",
"('grand', 'nombre') \t57 \t116 \n",
"('1', 'degré') \t58 \t136 \n",
"\n",
"Norm \t566.1139461274558 \t394.0913599661886 \n",
"Projected norm \t523.5570647025976 \t346.991354359154 \n",
"Projection ratio \t0.9248262973983034 \t0.8804845515743467 \n"
]
}
],
"source": [
"compareTop10_2gram(belles_lettres, philo)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "2621ff10",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 4\n",
"Scalar product: 0.9041105011801019\n",
"\n",
"n-gram \tArchitecture \tMétiers \n",
"('s.', 'm.') \t323 \t612 \n",
"('d.', 'j.') \t325 \t631 \n",
"('piece', 'bois') \t40 \t155 \n",
"('s.', 'f.') \t194 \t438 \n",
"\n",
"Norm \t511.93358944300576 \t1067.1466628350574 \n",
"Projected norm \t499.18934283496077 \t994.2705869128383 \n",
"Projection ratio \t0.9751056643462075 \t0.9317094093434061 \n"
]
}
],
"source": [
"compareTop10_2gram(architecture, metiers)"
]
},
{
"cell_type": "markdown",
"id": "3559d324",
"metadata": {},
"source": [
"We see two cases perfectly illustrated:\n",
"\n",
"1. Belles-lettres and philo live in a very similar space but are differently oriented, so they have a low normalized scalar-product\n",
"2. Architecture and Métiers spread on more separate subspaces, but are almost entirely concentrated on the 4 dimensions they share\n",
"\n",
"But does increasing the number of top ranks efficiently weakens the effect of the noise ?"
]
},
{
"cell_type": "code",
"execution_count": null,
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
"id": "667fde5c",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 22\n",
"Scalar product: 0.7998536585283141\n",
"\n",
"n-gram \tArchitecture \tMétiers \n",
"('haut', 'bas') \t11 \t66 \n",
"('m.', 'pl') \t29 \t51 \n",
"('barre', 'fer') \t20 \t70 \n",
"('a', 'plusieurs') \t14 \t45 \n",
"('vers', 'act') \t50 \t110 \n",
"('endroit', 'où') \t29 \t87 \n",
"('sou', 'nom') \t29 \t56 \n",
"('s.', 'm.') \t323 \t612 \n",
"('m.', 'espece') \t14 \t40 \n",
"('piece', 'bois') \t40 \t155 \n",
"('où', 'a') \t26 \t93 \n",
"('chaque', 'côté') \t13 \t73 \n",
"('pouce', 'épaisseur') \t15 \t52 \n",
"('a', 'point') \t17 \t83 \n",
"('a', 'b') \t22 \t164 \n",
"('1', 'degré') \t14 \t95 \n",
"('a', 'donné') \t30 \t49 \n",
"('s.', 'f.') \t194 \t438 \n",
"('pieces', 'bois') \t26 \t161 \n",
"('d.', 'j.') \t325 \t631 \n",
"('grand', 'nombre') \t18 \t67 \n",
"('2', 'degré') \t13 \t95 \n",
"\n",
"Norm \t535.4941643006018 \t1230.2991506133783 \n",
"Projected norm \t509.15027251293895 \t1062.1624169589131 \n",
"Projection ratio \t0.9508045212368091 \t0.8633367067102022 \n"
]
}
],
"source": [
"compareTop100_2gram(architecture, metiers)"
]
},
{
"cell_type": "markdown",
"id": "ed4c2139",
"metadata": {},
"source": [
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
"Of course, the top10 is included in the top100, so the most prominent n-grams remain unchanged. What the above shows is that the \"long-tail\" of less-frequent 2-grams isn't enough to counter-balance the main, noisy, dimensions.\n",
"\n",
"Let's try and find less noisy top n-grams in the new domains we've made available."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b7a352d7",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('a', 'b'): 649,\n",
" ('b', 'a'): 202,\n",
" ('ligne', 'droite'): 193,\n",
" ('a', 'a'): 149,\n",
" ('1', 'degré'): 142,\n",
" ('2', 'degré'): 129,\n",
" ('b', 'b'): 110,\n",
" ('s.', 'f.'): 110,\n",
" ('angle', 'droit'): 102,\n",
" ('2', '3'): 97}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(maths)"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "20caa4fb",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"{('a', 'b'): 370,\n",
" ('1', 'degré'): 301,\n",
" ('2', 'degré'): 298,\n",
" ('ligne', 'droite'): 235,\n",
" ('s.', 'm.'): 228,\n",
" ('3', 'degré'): 208,\n",
" ('s.', 'f.'): 197,\n",
" ('m.', 'newton'): 180,\n",
" ('quart', 'cercle'): 180,\n",
" ('grand', 'nombre'): 166}"
]
},
"execution_count": null,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"top10_2grams(physique)"
]
},
{
"cell_type": "markdown",
"id": "d00e8eac",
"metadata": {},
"source": [
"Except for `('s.', 'f.')` (and `('s.', 'm.')` only for physics, which in itself is quite interesting — is maths vocabulary more feminine ?), most of the top 10 2-grams seem actually related to their respective domains. Moreover, neither of these substantive-related 2-grams are the most frequent in their domains, so they don't carry virtually all the weight of the vectors like they used to in the Architecture and Métiers domains above."
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "8a3b1aba",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 5\n",
"Scalar product: 0.6471193972004591\n",
"\n",
"n-gram \tMathématiques \tPhysique - [Sciences\n",
"('s.', 'f.') \t110 \t197 \n",
"('1', 'degré') \t142 \t301 \n",
"('ligne', 'droite') \t193 \t235 \n",
"('a', 'b') \t649 \t370 \n",
"('2', 'degré') \t129 \t298 \n",
"\n",
"Norm \t776.062497483289 \t773.267741471219 \n",
"Projected norm \t712.2885651195027 \t640.5770835738663 \n",
"Projection ratio \t0.9178237157824269 \t0.8284026983397814 \n"
]
}
],
"source": [
"compareTop10_2gram(maths, physique)"
]
},
{
"cell_type": "markdown",
"id": "b40c55f4",
"metadata": {},
"source": [
"Here something very interesting appears: the same phenomenon as before occurs, but it seems legitimate this time. Indeed, their scalar product is slightly greater than their intersection (5 out of 10 is 50% or 0.50, so not too far below 0.64), but their intersection is almost entirely populated by relevant 2-grams. This behaviour is even more pronounced if we increase the size of the top ranking:"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "0fdc33ac",
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Intersection: 27\n",
"Scalar product: 0.5766009312352989\n",
"\n",
"n-gram \tMathématiques \tPhysique - [Sciences\n",
"('a', 'donné') \t86 \t122 \n",
"('b', 'e') \t93 \t51 \n",
"('infiniment', 'petit') \t64 \t46 \n",
"('point', 'a') \t42 \t84 \n",
"('où', 'ensuit') \t29 \t72 \n",
"('5', 'degré') \t37 \t82 \n",
"('1', '2') \t94 \t50 \n",
"('1', 'degré') \t142 \t301 \n",
"('point', 'où') \t35 \t80 \n",
"('partie', 'égal') \t59 \t81 \n",
"('a', 'a') \t149 \t57 \n",
"('m.', 'newton') \t48 \t180 \n",
"('2', 'degré') \t129 \t298 \n",
"('2', '3') \t97 \t51 \n",
"('grand', 'nombre') \t43 \t166 \n",
"('s.', 'f.') \t110 \t197 \n",
"('s.', 'm.') \t91 \t228 \n",
"('3', 'degré') \t86 \t208 \n",
"('ligne', 'droite') \t193 \t235 \n",
"('a', 'point') \t57 \t121 \n",
"('angle', 'droit') \t102 \t52 \n",
"('4', 'degré') \t52 \t109 \n",
"('sinus', 'angle') \t45 \t67 \n",
"('ci', 'dessus') \t78 \t64 \n",
"('e', 'f') \t44 \t96 \n",
"('a', 'b') \t649 \t370 \n",
"('quart', 'cercle') \t61 \t180 \n",
"\n",
"Norm \t923.9372273049722 \t1006.8286845337691 \n",
"Projected norm \t791.5649057405211 \t839.5510705132833 \n",
"Projection ratio \t0.8567301785743959 \t0.8338569246286951 \n"
]
}
],
"source": [
"compareTop100_2gram(maths, physique)"
]
},
{
"cell_type": "markdown",
"id": "57a1ee20",
"metadata": {},
"source": [
"Here the intersection is proportionately much smaller but the scalar product has only slightly diminshed, corresponding to the situation we had previously observed between Architecture and Métiers where the cell at their intersection in the confusion matrix will be darker using the scalar product instead of simply counting the number of most frequent n-grams they have in common. However, contrary to the explanation between Architecture and Métiers, this is not due to noise but to relevant n-grams. Like Architecture and Métiers, they too share their most fundamental features, but this time they are the ones which actually define them.\n",
"\n",
"This shows how studying the colinearity of the most frequent n-grams in addition to counting the common n-grams can help refine the analysis."
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "/gnu/store/2rpsj69fzmcnafz4rml0blrynfayxqzr-python-wrapper-3.9.9/bin/python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.9"
}
},
"nbformat": 4,
"nbformat_minor": 5
}