Skip to content
Snippets Groups Projects
Commit 05ac721c authored by Ludovic Moncla's avatar Ludovic Moncla
Browse files

Update Predict.ipynb

parent 629f30e0
No related branches found
No related tags found
No related merge requests found
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
# BERT Predict classification # BERT Predict classification
## 1. Setup the environment ## 1. Setup the environment
### 1.1 Setup colab environment ### 1.1 Setup colab environment
#### 1.1.1 Install packages #### 1.1.1 Install packages
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
!pip install transformers==4.10.3 !pip install transformers==4.10.3
!pip install sentencepiece !pip install sentencepiece
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### 1.1.2 Use more RAM #### 1.1.2 Use more RAM
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from psutil import virtual_memory from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9 ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb)) print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))
if ram_gb < 20: if ram_gb < 20:
print('Not using a high-RAM runtime') print('Not using a high-RAM runtime')
else: else:
print('You are using a high-RAM runtime!') print('You are using a high-RAM runtime!')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
#### 1.1.3 Mount GoogleDrive #### 1.1.3 Mount GoogleDrive
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
from google.colab import drive from google.colab import drive
drive.mount('/content/drive') drive.mount('/content/drive')
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 1.2 Setup GPU ### 1.2 Setup GPU
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import torch import torch
# If there's a GPU available... # If there's a GPU available...
if torch.cuda.is_available(): if torch.cuda.is_available():
# Tell PyTorch to use the GPU. # Tell PyTorch to use the GPU.
device = torch.device("cuda") device = torch.device("cuda")
print('There are %d GPU(s) available.' % torch.cuda.device_count()) print('There are %d GPU(s) available.' % torch.cuda.device_count())
print('We will use the GPU:', torch.cuda.get_device_name(0)) print('We will use the GPU:', torch.cuda.get_device_name(0))
# for MacOS # for MacOS
elif torch.backends.mps.is_available() and torch.backends.mps.is_built(): elif torch.backends.mps.is_available() and torch.backends.mps.is_built():
device = torch.device("mps") device = torch.device("mps")
print('We will use the GPU') print('We will use the GPU')
else: else:
device = torch.device("cpu") device = torch.device("cpu")
print('No GPU available, using the CPU instead.') print('No GPU available, using the CPU instead.')
``` ```
%% Output %% Output
We will use the GPU We will use the GPU
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
### 1.3 Import librairies ### 1.3 Import librairies
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pandas as pd import pandas as pd
import numpy as np import numpy as np
from transformers import BertTokenizer, BertForSequenceClassification, CamembertTokenizer, CamembertForSequenceClassification from transformers import BertTokenizer, BertForSequenceClassification, CamembertTokenizer, CamembertForSequenceClassification
from torch.utils.data import TensorDataset, DataLoader, SequentialSampler from torch.utils.data import TensorDataset, DataLoader, SequentialSampler
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## 2. Load Data ## 2. Load Data
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#!wget https://geode.liris.cnrs.fr/files/datasets/EDdA/Classification/LGE_withContent.tsv #!wget https://geode.liris.cnrs.fr/files/datasets/EDdA/Classification/LGE_withContent.tsv
#!wget https://geode.liris.cnrs.fr/EDdA-Classification/datasets/EDdA_dataset_articles_no_superdomain.tsv #!wget https://geode.liris.cnrs.fr/EDdA-Classification/datasets/EDdA_dataset_articles_no_superdomain.tsv
#!wget https://geode.liris.cnrs.fr/EDdA-Classification/datasets/Parallel_datatset_articles_230215.tsv #!wget https://geode.liris.cnrs.fr/EDdA-Classification/datasets/Parallel_datatset_articles_230215.tsv
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#drive_path = "drive/MyDrive/Classification-EDdA/" #drive_path = "drive/MyDrive/Classification-EDdA/"
drive_path = "../" drive_path = "../"
#path = "/Users/lmoncla/git/gitlab.liris/GEODE/EDdA/output/" #path = "/Users/lmoncla/git/gitlab.liris/GEODE/EDdA/output/"
path = "/Users/lmoncla/git/gitlab.liris/GEODE/LGE/output/" path = "/Users/lmoncla/git/gitlab.liris/GEODE/LGE/output/"
#filepath = "Parallel_datatset_articles_230215.tsv" #filepath = "Parallel_datatset_articles_230215.tsv"
#filepath = "EDdA_dataset_articles.tsv" #filepath = "EDdA_dataset_articles.tsv"
filepath = "LGE_dataset_articles_230314.tsv" filepath = "LGE_dataset_articles_230314.tsv"
corpus = 'lge' corpus = 'lge'
#corpus = '' #corpus = ''
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df = pd.read_csv(path + filepath, sep="\t") df = pd.read_csv(path + filepath, sep="\t")
df.head() df.head()
``` ```
%% Output %% Output
uid lge-volume lge-numero lge-head lge-page lge-id \ uid lge-volume lge-numero lge-head lge-page lge-id \
0 lge_1_a-0 1 1 A 0 a-0 0 lge_1_a-0 1 1 A 0 a-0
1 lge_1_a-1 1 2 A 1 a-1 1 lge_1_a-1 1 2 A 1 a-1
2 lge_1_a-2 1 3 A 4 a-2 2 lge_1_a-2 1 3 A 4 a-2
3 lge_1_a-3 1 4 A 4 a-3 3 lge_1_a-3 1 4 A 4 a-3
4 lge_1_a-4 1 5 A 4 a-4 4 lge_1_a-4 1 5 A 4 a-4
lge-content lge-nbWords lge-content lge-nbWords
0 A(Ling.). Son vocal et première lettre de notr... 1761.0 0 A(Ling.). Son vocal et première lettre de notr... 1761.0
1 A(Paléogr.). C’est à l’alphabet phénicien, on ... 839.0 1 A(Paléogr.). C’est à l’alphabet phénicien, on ... 839.0
2 A(Log.). Cette voyelle désigne les proposition... 56.0 2 A(Log.). Cette voyelle désigne les proposition... 56.0
3 A(Mus.). La lettre a est employée par les musi... 267.0 3 A(Mus.). La lettre a est employée par les musi... 267.0
4 A(Numis.). Dans la numismatique grecque, la le... 67.0 4 A(Numis.). Dans la numismatique grecque, la le... 67.0
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
data = df[corpus+'-content'].values data = df[corpus+'-content'].values
``` ```
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
## 3. Load model and predict ## 3. Load model and predict
### 3.1 BERT / CamemBERT ### 3.1 BERT / CamemBERT
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
model_name = "bert-base-multilingual-cased" model_name = "bert-base-multilingual-cased"
#model_name = "camembert-base" #model_name = "camembert-base"
#model_path = path + "models/model_" + model_name + "_s10000.pt" #model_path = path + "models/model_" + model_name + "_s10000.pt"
model_path = drive_path + "models/model_" + model_name + "_s10000_superdomains.pt" model_path = drive_path + "models/model_" + model_name + "_s10000_superdomains.pt"
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
def generate_dataloader(tokenizer, sentences, batch_size = 8, max_len = 512): def generate_dataloader(tokenizer, sentences, batch_size = 8, max_len = 512):
# Tokenize all of the sentences and map the tokens to thier word IDs. # Tokenize all of the sentences and map the tokens to thier word IDs.
input_ids_test = [] input_ids_test = []
# For every sentence... # For every sentence...
for sent in sentences: for sent in sentences:
# `encode` will: # `encode` will:
# (1) Tokenize the sentence. # (1) Tokenize the sentence.
# (2) Prepend the `[CLS]` token to the start. # (2) Prepend the `[CLS]` token to the start.
# (3) Append the `[SEP]` token to the end. # (3) Append the `[SEP]` token to the end.
# (4) Map tokens to their IDs. # (4) Map tokens to their IDs.
encoded_sent = tokenizer.encode( encoded_sent = tokenizer.encode(
sent, # Sentence to encode. sent, # Sentence to encode.
add_special_tokens = True, # Add '[CLS]' and '[SEP]' add_special_tokens = True, # Add '[CLS]' and '[SEP]'
# This function also supports truncation and conversion # This function also supports truncation and conversion
# to pytorch tensors, but I need to do padding, so I # to pytorch tensors, but I need to do padding, so I
# can't use these features. # can't use these features.
#max_length = max_len, # Truncate all sentences. #max_length = max_len, # Truncate all sentences.
#return_tensors = 'pt', # Return pytorch tensors. #return_tensors = 'pt', # Return pytorch tensors.
) )
input_ids_test.append(encoded_sent) input_ids_test.append(encoded_sent)
# Pad our input tokens # Pad our input tokens
padded_test = [] padded_test = []
for i in input_ids_test: for i in input_ids_test:
if len(i) > max_len: if len(i) > max_len:
padded_test.extend([i[:max_len]]) padded_test.extend([i[:max_len]])
else: else:
padded_test.extend([i + [0] * (max_len - len(i))]) padded_test.extend([i + [0] * (max_len - len(i))])
input_ids_test = np.array(padded_test) input_ids_test = np.array(padded_test)
# Create attention masks # Create attention masks
attention_masks = [] attention_masks = []
# Create a mask of 1s for each token followed by 0s for padding # Create a mask of 1s for each token followed by 0s for padding
for seq in input_ids_test: for seq in input_ids_test:
seq_mask = [float(i>0) for i in seq] seq_mask = [float(i>0) for i in seq]
attention_masks.append(seq_mask) attention_masks.append(seq_mask)
# Convert to tensors. # Convert to tensors.
inputs = torch.tensor(input_ids_test) inputs = torch.tensor(input_ids_test)
masks = torch.tensor(attention_masks) masks = torch.tensor(attention_masks)
#set batch size #set batch size
# Create the DataLoader. # Create the DataLoader.
data = TensorDataset(inputs, masks) data = TensorDataset(inputs, masks)
prediction_sampler = SequentialSampler(data) prediction_sampler = SequentialSampler(data)
return DataLoader(data, sampler=prediction_sampler, batch_size=batch_size) return DataLoader(data, sampler=prediction_sampler, batch_size=batch_size)
def predict(model, dataloader, device): def predict(model, dataloader, device):
# Put model in evaluation mode # Put model in evaluation mode
model.eval() model.eval()
# Tracking variables # Tracking variables
predictions_test , true_labels = [], [] predictions_test , true_labels = [], []
pred_labels_ = [] pred_labels_ = []
# Predict # Predict
for batch in dataloader: for batch in dataloader:
# Add batch to GPU # Add batch to GPU
batch = tuple(t.to(device) for t in batch) batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from the dataloader # Unpack the inputs from the dataloader
b_input_ids, b_input_mask = batch b_input_ids, b_input_mask = batch
# Telling the model not to compute or store gradients, saving memory and # Telling the model not to compute or store gradients, saving memory and
# speeding up prediction # speeding up prediction
with torch.no_grad(): with torch.no_grad():
# Forward pass, calculate logit predictions # Forward pass, calculate logit predictions
outputs = model(b_input_ids, token_type_ids=None, outputs = model(b_input_ids, token_type_ids=None,
attention_mask=b_input_mask) attention_mask=b_input_mask)
logits = outputs[0] logits = outputs[0]
#print(logits) #print(logits)
# Move logits and labels to CPU ??? # Move logits and labels to CPU ???
logits = logits.detach().cpu().numpy() logits = logits.detach().cpu().numpy()
#print(logits) #print(logits)
# Store predictions and true labels # Store predictions and true labels
predictions_test.append(logits) predictions_test.append(logits)
pred_labels = [] pred_labels = []
for i in range(len(predictions_test)): for i in range(len(predictions_test)):
# The predictions for this batch are a 2-column ndarray (one column for "0" # The predictions for this batch are a 2-column ndarray (one column for "0"
# and one column for "1"). Pick the label with the highest value and turn this # and one column for "1"). Pick the label with the highest value and turn this
# in to a list of 0s and 1s. # in to a list of 0s and 1s.
pred_labels_i = np.argmax(predictions_test[i], axis=1).flatten() pred_labels_i = np.argmax(predictions_test[i], axis=1).flatten()
pred_labels.append(pred_labels_i) pred_labels.append(pred_labels_i)
pred_labels_ += [item for sublist in pred_labels for item in sublist] pred_labels_ += [item for sublist in pred_labels for item in sublist]
return pred_labels_ return pred_labels_
#https://discuss.huggingface.co/t/i-have-trained-my-classifier-now-how-do-i-do-predictions/3625/3 #https://discuss.huggingface.co/t/i-have-trained-my-classifier-now-how-do-i-do-predictions/3625/3
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
if model_name == 'bert-base-multilingual-cased' : if model_name == 'bert-base-multilingual-cased' :
print('Loading Bert Tokenizer...') print('Loading Bert Tokenizer...')
tokenizer = BertTokenizer.from_pretrained(model_name) tokenizer = BertTokenizer.from_pretrained(model_name)
elif model_name == 'camembert-base': elif model_name == 'camembert-base':
print('Loading Camembert Tokenizer...') print('Loading Camembert Tokenizer...')
tokenizer = CamembertTokenizer.from_pretrained(model_name) tokenizer = CamembertTokenizer.from_pretrained(model_name)
``` ```
%% Output %% Output
Loading Bert Tokenizer... Loading Bert Tokenizer...
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
data_loader = generate_dataloader(tokenizer, data) data_loader = generate_dataloader(tokenizer, data)
``` ```
%% Output %% Output
Token indices sequence length is longer than the specified maximum sequence length for this model (3408 > 512). Running this sequence through the model will result in indexing errors Token indices sequence length is longer than the specified maximum sequence length for this model (3408 > 512). Running this sequence through the model will result in indexing errors
%% Cell type:markdown id: tags: %% Cell type:markdown id: tags:
https://discuss.huggingface.co/t/an-efficient-way-of-loading-a-model-that-was-saved-with-torch-save/9814 https://discuss.huggingface.co/t/an-efficient-way-of-loading-a-model-that-was-saved-with-torch-save/9814
https://github.com/huggingface/transformers/issues/2094 https://github.com/huggingface/transformers/issues/2094
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#model = torch.load(model_path, map_location=torch.device('mps')) #model = torch.load(model_path, map_location=torch.device('mps'))
#model.load_state_dict(torch.load(model_path, map_location=torch.device('mps'))) #model.load_state_dict(torch.load(model_path, map_location=torch.device('mps')))
#model = BertForSequenceClassification.from_pretrained(model_path).to("cuda") #model = BertForSequenceClassification.from_pretrained(model_path).to("cuda")
model = BertForSequenceClassification.from_pretrained(model_path).to("cpu") model = BertForSequenceClassification.from_pretrained(model_path).to("mps")
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pred = predict(model, data_loader, device) pred = predict(model, data_loader, device)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
## TEST ## TEST
# https://huggingface.co/docs/transformers/main_classes/pipelines
from transformers import TextClassificationPipeline
def data():
for i in range(1000):
yield f"Lyon, petite ville de France. {i}"
from transformers import TextClassificationPipeline
pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True) pipe = TextClassificationPipeline(model=model, tokenizer=tokenizer, return_all_scores=True, device=device)
# outputs a list of dicts like [[{'label': 'NEGATIVE', 'score': 0.0001223755971295759}, {'label': 'POSITIVE', 'score': 0.9998776316642761}]]
prob = pipe("Lyon, petite ville de France, dans la région Rhone-Alpes.")
prob
``` ```
%% Output %% Output
/opt/homebrew/Caskroom/miniforge/base/envs/geode-classification-py39/lib/python3.9/site-packages/transformers/pipelines/text_classification.py:89: UserWarning: `return_all_scores` is now deprecated, if want a similar funcionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`. /opt/homebrew/Caskroom/miniforge/base/envs/geode-classification-py39/lib/python3.9/site-packages/transformers/pipelines/text_classification.py:89: UserWarning: `return_all_scores` is now deprecated, if want a similar funcionality use `top_k=None` instead of `return_all_scores=True` or `top_k=1` instead of `return_all_scores=False`.
warnings.warn( warnings.warn(
[[{'label': 'LABEL_0', 'score': 9.58614400587976e-05}, %% Cell type:code id: tags:
{'label': 'LABEL_1', 'score': 0.00010365043272031471},
{'label': 'LABEL_2', 'score': 6.21283397777006e-05}, ``` python
{'label': 'LABEL_3', 'score': 9.175329614663497e-05}, cpt = 0
{'label': 'LABEL_4', 'score': 9.065424819709733e-05}, for out in pipe(data()):
{'label': 'LABEL_5', 'score': 0.00010455227311467752}, print(out)
{'label': 'LABEL_6', 'score': 0.9985577464103699}, # outputs a list of dicts like [[{'label': 'NEGATIVE', 'score': 0.0001223755971295759}, {'label': 'POSITIVE', 'score': 0.9998776316642761}]]
{'label': 'LABEL_7', 'score': 0.00013558757200371474}, # proba de la class Géographie : 6
{'label': 'LABEL_8', 'score': 0.0001018877956084907}, print(out[6]['label'][6:]) ### TODO modifier ici
{'label': 'LABEL_9', 'score': 0.0001431443088222295}, cpt += 1
{'label': 'LABEL_10', 'score': 0.00010823880438692868}, if cpt == 6:
{'label': 'LABEL_11', 'score': 3.7985137169016525e-05}, break
{'label': 'LABEL_12', 'score': 6.803833093727008e-05}, ```
{'label': 'LABEL_13', 'score': 4.024818554171361e-05},
{'label': 'LABEL_14', 'score': 0.0001047810583258979}, %% Output
{'label': 'LABEL_15', 'score': 8.337549661519006e-05},
{'label': 'LABEL_16', 'score': 7.031656423350796e-05}]] [{'label': 'LABEL_0', 'score': 9.43058475968428e-05}, {'label': 'LABEL_1', 'score': 0.00013377856521401554}, {'label': 'LABEL_2', 'score': 6.315444625215605e-05}, {'label': 'LABEL_3', 'score': 9.087997023016214e-05}, {'label': 'LABEL_4', 'score': 0.00012772278569173068}, {'label': 'LABEL_5', 'score': 0.00012729596346616745}, {'label': 'LABEL_6', 'score': 0.9983708262443542}, {'label': 'LABEL_7', 'score': 0.00015073739632498473}, {'label': 'LABEL_8', 'score': 0.00013310853682924062}, {'label': 'LABEL_9', 'score': 0.0001363410265184939}, {'label': 'LABEL_10', 'score': 0.00011535766680026427}, {'label': 'LABEL_11', 'score': 4.8044770665001124e-05}, {'label': 'LABEL_12', 'score': 7.562591781606898e-05}, {'label': 'LABEL_13', 'score': 4.6062668843660504e-05}, {'label': 'LABEL_14', 'score': 0.00012537441216409206}, {'label': 'LABEL_15', 'score': 9.473998215980828e-05}, {'label': 'LABEL_16', 'score': 6.669617869192734e-05}]
6
[{'label': 'LABEL_0', 'score': 9.85840815701522e-05}, {'label': 'LABEL_1', 'score': 0.0001410262193530798}, {'label': 'LABEL_2', 'score': 6.340965774143115e-05}, {'label': 'LABEL_3', 'score': 9.572453564032912e-05}, {'label': 'LABEL_4', 'score': 0.00011747579992515966}, {'label': 'LABEL_5', 'score': 0.00012954592239111662}, {'label': 'LABEL_6', 'score': 0.9982858300209045}, {'label': 'LABEL_7', 'score': 0.0001560843811603263}, {'label': 'LABEL_8', 'score': 0.00015996283036656678}, {'label': 'LABEL_9', 'score': 0.0001614005013834685}, {'label': 'LABEL_10', 'score': 0.00010834677232196555}, {'label': 'LABEL_11', 'score': 4.9881378799909726e-05}, {'label': 'LABEL_12', 'score': 7.358138827839866e-05}, {'label': 'LABEL_13', 'score': 5.4664047638652846e-05}, {'label': 'LABEL_14', 'score': 0.00013466033851727843}, {'label': 'LABEL_15', 'score': 9.780169057194144e-05}, {'label': 'LABEL_16', 'score': 7.196604565251619e-05}]
6
[{'label': 'LABEL_0', 'score': 9.556901204632595e-05}, {'label': 'LABEL_1', 'score': 0.0001365469943266362}, {'label': 'LABEL_2', 'score': 6.268925790209323e-05}, {'label': 'LABEL_3', 'score': 9.737971413414925e-05}, {'label': 'LABEL_4', 'score': 0.00012014496314805001}, {'label': 'LABEL_5', 'score': 0.00012252115993760526}, {'label': 'LABEL_6', 'score': 0.9983487129211426}, {'label': 'LABEL_7', 'score': 0.0001454231096431613}, {'label': 'LABEL_8', 'score': 0.00014558130351360887}, {'label': 'LABEL_9', 'score': 0.00014958814426790923}, {'label': 'LABEL_10', 'score': 0.00011634181282715872}, {'label': 'LABEL_11', 'score': 4.5097345719113946e-05}, {'label': 'LABEL_12', 'score': 8.068335591815412e-05}, {'label': 'LABEL_13', 'score': 4.724525933852419e-05}, {'label': 'LABEL_14', 'score': 0.00012563375639729202}, {'label': 'LABEL_15', 'score': 9.24634441616945e-05}, {'label': 'LABEL_16', 'score': 6.83424441376701e-05}]
6
[{'label': 'LABEL_0', 'score': 9.575629519531503e-05}, {'label': 'LABEL_1', 'score': 0.00013479188783094287}, {'label': 'LABEL_2', 'score': 6.24070453341119e-05}, {'label': 'LABEL_3', 'score': 9.491511445958167e-05}, {'label': 'LABEL_4', 'score': 0.00011898632510565221}, {'label': 'LABEL_5', 'score': 0.00012223367230035365}, {'label': 'LABEL_6', 'score': 0.9983828067779541}, {'label': 'LABEL_7', 'score': 0.00014901417307555676}, {'label': 'LABEL_8', 'score': 0.0001293729292228818}, {'label': 'LABEL_9', 'score': 0.00014636504056397825}, {'label': 'LABEL_10', 'score': 0.00011709715909091756}, {'label': 'LABEL_11', 'score': 4.3970183469355106e-05}, {'label': 'LABEL_12', 'score': 7.832375558791682e-05}, {'label': 'LABEL_13', 'score': 4.6482757170451805e-05}, {'label': 'LABEL_14', 'score': 0.00011872482718899846}, {'label': 'LABEL_15', 'score': 9.005393803818151e-05}, {'label': 'LABEL_16', 'score': 6.87053834553808e-05}]
6
[{'label': 'LABEL_0', 'score': 9.33124974835664e-05}, {'label': 'LABEL_1', 'score': 0.00012642868387047201}, {'label': 'LABEL_2', 'score': 6.495929847005755e-05}, {'label': 'LABEL_3', 'score': 9.773051715455949e-05}, {'label': 'LABEL_4', 'score': 0.00011607634951360524}, {'label': 'LABEL_5', 'score': 0.00012188677646918222}, {'label': 'LABEL_6', 'score': 0.9983865022659302}, {'label': 'LABEL_7', 'score': 0.0001447165122954175}, {'label': 'LABEL_8', 'score': 0.00012925465125590563}, {'label': 'LABEL_9', 'score': 0.0001489764981670305}, {'label': 'LABEL_10', 'score': 0.0001232580398209393}, {'label': 'LABEL_11', 'score': 4.4239117414690554e-05}, {'label': 'LABEL_12', 'score': 7.944685057736933e-05}, {'label': 'LABEL_13', 'score': 4.5822369429515675e-05}, {'label': 'LABEL_14', 'score': 0.00011649943189695477}, {'label': 'LABEL_15', 'score': 9.088807564694434e-05}, {'label': 'LABEL_16', 'score': 6.998340541031212e-05}]
6
[{'label': 'LABEL_0', 'score': 9.538340236758813e-05}, {'label': 'LABEL_1', 'score': 0.00013363973994273692}, {'label': 'LABEL_2', 'score': 6.720751116517931e-05}, {'label': 'LABEL_3', 'score': 0.00010068194387713447}, {'label': 'LABEL_4', 'score': 0.00011288334644632414}, {'label': 'LABEL_5', 'score': 0.00012565024371724576}, {'label': 'LABEL_6', 'score': 0.9983444213867188}, {'label': 'LABEL_7', 'score': 0.00015267464914359152}, {'label': 'LABEL_8', 'score': 0.00014014744374435395}, {'label': 'LABEL_9', 'score': 0.00014672863471787423}, {'label': 'LABEL_10', 'score': 0.0001220486665260978}, {'label': 'LABEL_11', 'score': 4.699776036432013e-05}, {'label': 'LABEL_12', 'score': 7.61943738325499e-05}, {'label': 'LABEL_13', 'score': 4.92853214382194e-05}, {'label': 'LABEL_14', 'score': 0.00012135148426750675}, {'label': 'LABEL_15', 'score': 9.276873606722802e-05}, {'label': 'LABEL_16', 'score': 7.18621740816161e-05}]
6
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
prob[0][0]['label'][6:] prob[0][0]['label'][6:]
``` ```
%% Output %% Output
'0' '0'
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
## TEST ## TEST
encoder.inverse_transform([int(prob[0][0]['label'][6:])]) encoder.inverse_transform([int(prob[0][0]['label'][6:])])
``` ```
%% Output %% Output
array(['Agriculture'], dtype=object) array(['Agriculture'], dtype=object)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
pred pred
``` ```
%% Output %% Output
[13, [13,
6, 6,
13, 13,
10, 10,
7, 7,
4, 4,
6, 6,
6, 6,
6, 6,
6, 6,
6, 6,
6, 6,
11, 11,
7, 7,
8, 8,
8, 8,
8, 8,
7, 7,
7, 7,
7, 7,
6, 6,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
6, 6,
8, 8,
6, 6,
6, 6,
6, 6,
4, 4,
8, 8,
7, 7,
6, 6,
6, 6,
6, 6,
6, 6,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
16, 16,
7, 7,
10, 10,
7, 7,
7, 7,
7, 7,
7, 7,
6, 6,
11, 11,
3, 3,
9, 9,
7, 7,
4, 4,
6, 6,
7, 7,
14, 14,
1, 1,
8, 8,
6, 6,
8, 8,
7, 7,
5, 5,
7, 7,
14, 14,
6, 6,
3, 3,
16, 16,
9, 9,
2, 2,
1, 1,
1, 1,
7, 7,
7, 7,
5, 5,
6, 6,
7, 7,
8, 8,
7, 7,
8, 8,
0, 0,
9, 9,
14, 14,
6, 6,
8, 8,
6, 6,
7, 7,
6, 6,
9, 9,
8, 8,
8, 8,
6, 6,
7, 7,
7, 7,
5, 5,
5, 5,
8, 8,
5, 5,
5, 5,
5, 5,
5, 5,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
9, 9,
7, 7,
7, 7,
7, 7,
8, 8,
6, 6,
6, 6,
7, 7,
7, 7,
4, 4,
7, 7,
7, 7,
7, 7,
7, 7,
4, 4,
0, 0,
4, 4,
4, 4,
0, 0,
8, 8,
9, 9,
1, 1,
1, 1,
6, 6,
7, 7,
1, 1,
9, 9,
5, 5,
7, 7,
5, 5,
8, 8,
2, 2,
6, 6,
7, 7,
8, 8,
5, 5,
7, 7,
6, 6,
7, 7,
3, 3,
7, 7,
7, 7,
7, 7,
2, 2,
7, 7,
2, 2,
8, 8,
7, 7,
7, 7,
6, 6,
7, 7,
6, 6,
7, 7,
6, 6,
7, 7,
7, 7,
6, 6,
7, 7,
7, 7,
7, 7,
1, 1,
11, 11,
1, 1,
1, 1,
7, 7,
9, 9,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
9, 9,
7, 7,
7, 7,
7, 7,
6, 6,
7, 7,
10, 10,
6, 6,
16, 16,
12, 12,
9, 9,
7, 7,
7, 7,
7, 7,
8, 8,
6, 6,
7, 7,
3, 3,
6, 6,
7, 7,
7, 7,
6, 6,
6, 6,
7, 7,
6, 6,
7, 7,
7, 7,
7, 7,
6, 6,
7, 7,
7, 7,
6, 6,
1, 1,
2, 2,
2, 2,
16, 16,
2, 2,
9, 9,
11, 11,
16, 16,
7, 7,
7, 7,
7, 7,
6, 6,
8, 8,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
1, 1,
7, 7,
7, 7,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
8, 8,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
13, 13,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
5, 5,
8, 8,
9, 9,
11, 11,
8, 8,
7, 7,
11, 11,
9, 9,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
8, 8,
7, 7,
13, 13,
8, 8,
7, 7,
12, 12,
6, 6,
7, 7,
5, 5,
8, 8,
11, 11,
8, 8,
14, 14,
2, 2,
11, 11,
1, 1,
7, 7,
10, 10,
11, 11,
8, 8,
7, 7,
6, 6,
6, 6,
7, 7,
16, 16,
7, 7,
6, 6,
7, 7,
7, 7,
1, 1,
8, 8,
10, 10,
7, 7,
7, 7,
8, 8,
1, 1,
1, 1,
7, 7,
7, 7,
8, 8,
9, 9,
13, 13,
8, 8,
16, 16,
7, 7,
6, 6,
8, 8,
7, 7,
7, 7,
7, 7,
6, 6,
16, 16,
13, 13,
6, 6,
7, 7,
7, 7,
5, 5,
6, 6,
7, 7,
8, 8,
7, 7,
6, 6,
6, 6,
6, 6,
6, 6,
6, 6,
6, 6,
11, 11,
8, 8,
7, 7,
7, 7,
6, 6,
8, 8,
6, 6,
6, 6,
6, 6,
11, 11,
1, 1,
6, 6,
11, 11,
14, 14,
6, 6,
10, 10,
6, 6,
6, 6,
8, 8,
5, 5,
7, 7,
7, 7,
7, 7,
16, 16,
7, 7,
7, 7,
13, 13,
7, 7,
6, 6,
7, 7,
6, 6,
7, 7,
7, 7,
8, 8,
9, 9,
13, 13,
7, 7,
7, 7,
8, 8,
5, 5,
7, 7,
8, 8,
3, 3,
14, 14,
8, 8,
14, 14,
8, 8,
7, 7,
5, 5,
6, 6,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
9, 9,
7, 7,
7, 7,
3, 3,
6, 6,
7, 7,
7, 7,
5, 5,
6, 6,
6, 6,
5, 5,
16, 16,
7, 7,
7, 7,
7, 7,
6, 6,
9, 9,
6, 6,
16, 16,
6, 6,
7, 7,
5, 5,
6, 6,
8, 8,
11, 11,
7, 7,
7, 7,
6, 6,
6, 6,
5, 5,
2, 2,
7, 7,
8, 8,
6, 6,
13, 13,
11, 11,
14, 14,
7, 7,
8, 8,
16, 16,
7, 7,
7, 7,
7, 7,
8, 8,
9, 9,
0, 0,
2, 2,
6, 6,
8, 8,
3, 3,
6, 6,
1, 1,
6, 6,
6, 6,
6, 6,
16, 16,
7, 7,
3, 3,
16, 16,
6, 6,
6, 6,
6, 6,
13, 13,
5, 5,
7, 7,
9, 9,
7, 7,
2, 2,
6, 6,
6, 6,
6, 6,
7, 7,
13, 13,
6, 6,
14, 14,
6, 6,
7, 7,
7, 7,
7, 7,
5, 5,
5, 5,
6, 6,
7, 7,
6, 6,
8, 8,
9, 9,
9, 9,
7, 7,
7, 7,
5, 5,
7, 7,
11, 11,
7, 7,
4, 4,
6, 6,
9, 9,
7, 7,
7, 7,
3, 3,
6, 6,
12, 12,
9, 9,
7, 7,
1, 1,
7, 7,
7, 7,
7, 7,
7, 7,
8, 8,
6, 6,
7, 7,
7, 7,
8, 8,
13, 13,
7, 7,
7, 7,
7, 7,
7, 7,
6, 6,
7, 7,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
6, 6,
7, 7,
13, 13,
7, 7,
6, 6,
6, 6,
7, 7,
7, 7,
7, 7,
8, 8,
7, 7,
1, 1,
2, 2,
7, 7,
7, 7,
7, 7,
6, 6,
5, 5,
9, 9,
7, 7,
2, 2,
6, 6,
3, 3,
4, 4,
6, 6,
16, 16,
5, 5,
5, 5,
5, 5,
5, 5,
5, 5,
5, 5,
6, 6,
8, 8,
8, 8,
8, 8,
13, 13,
5, 5,
5, 5,
5, 5,
1, 1,
8, 8,
7, 7,
2, 2,
14, 14,
8, 8,
11, 11,
8, 8,
7, 7,
16, 16,
7, 7,
7, 7,
7, 7,
7, 7,
16, 16,
7, 7,
7, 7,
16, 16,
7, 7,
7, 7,
16, 16,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
11, 11,
8, 8,
6, 6,
8, 8,
7, 7,
6, 6,
6, 6,
7, 7,
7, 7,
7, 7,
12, 12,
8, 8,
11, 11,
7, 7,
7, 7,
8, 8,
10, 10,
14, 14,
7, 7,
6, 6,
7, 7,
14, 14,
7, 7,
5, 5,
7, 7,
0, 0,
5, 5,
9, 9,
7, 7,
7, 7,
1, 1,
0, 0,
8, 8,
8, 8,
9, 9,
9, 9,
3, 3,
6, 6,
13, 13,
6, 6,
5, 5,
4, 4,
6, 6,
8, 8,
8, 8,
1, 1,
8, 8,
7, 7,
8, 8,
3, 3,
8, 8,
8, 8,
8, 8,
0, 0,
6, 6,
9, 9,
6, 6,
8, 8,
7, 7,
7, 7,
7, 7,
14, 14,
5, 5,
5, 5,
1, 1,
1, 1,
12, 12,
8, 8,
11, 11,
11, 11,
7, 7,
13, 13,
16, 16,
13, 13,
14, 14,
14, 14,
11, 11,
14, 14,
11, 11,
14, 14,
16, 16,
7, 7,
7, 7,
5, 5,
5, 5,
13, 13,
11, 11,
16, 16,
7, 7,
13, 13,
14, 14,
14, 14,
13, 13,
8, 8,
7, 7,
7, 7,
10, 10,
9, 9,
4, 4,
8, 8,
2, 2,
9, 9,
8, 8,
8, 8,
3, 3,
5, 5,
13, 13,
5, 5,
5, 5,
8, 8,
8, 8,
6, 6,
6, 6,
6, 6,
7, 7,
6, 6,
8, 8,
8, 8,
13, 13,
6, 6,
7, 7,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
6, 6,
8, 8,
9, 9,
7, 7,
7, 7,
7, 7,
6, 6,
8, 8,
13, 13,
7, 7,
13, 13,
7, 7,
6, 6,
7, 7,
7, 7,
7, 7,
7, 7,
7, 7,
5, 5,
1, 1,
7, 7,
1, 1,
7, 7,
6, 6,
6, 6,
8, 8,
8, 8,
7, 7,
6, 6,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
7, 7,
8, 8,
7, 7,
8, 8,
8, 8,
8, 8,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
11, 11,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
3, 3,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
11, 11,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
8, 8,
7, 7,
8, 8,
5, 5,
6, 6,
6, 6,
11, 11,
8, 8,
8, 8,
7, 7,
7, 7,
8, 8,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
13, 13,
7, 7,
7, 7,
13, 13,
7, 7,
7, 7,
8, 8,
8, 8,
8, 8,
6, 6,
7, 7,
7, 7,
9, 9,
7, 7,
7, 7,
7, 7,
10, 10,
9, 9,
10, 10,
14, 14,
3, 3,
14, 14,
14, 14,
9, 9,
16, 16,
5, 5,
7, 7,
13, 13,
8, 8,
13, 13,
5, 5,
5, 5,
5, 5,
5, 5,
13, 13,
16, 16,
5, 5,
13, 13,
2, 2,
11, 11,
8, 8,
10, 10,
7, 7,
1, 1,
14, 14,
14, 14,
10, 10,
9, 9,
5, 5,
8, 8,
8, 8,
4, 4,
2, 2,
7, 7,
13, 13,
8, 8,
8, 8,
8, 8,
6, 6,
1, 1,
8, 8,
7, 7,
0, 0,
6, 6,
9, 9,
2, 2,
1, 1,
8, 8,
11, 11,
12, 12,
9, 9,
10, 10,
7, 7,
13, 13,
11, 11,
13, 13,
1, 1,
5, 5,
10, 10,
10, 10,
10, 10,
10, 10,
2, 2,
9, 9,
3, 3,
9, 9,
6, 6,
1, 1,
13, 13,
11, 11,
11, 11,
11, 11,
1, 1,
1, 1,
13, 13,
3, 3,
1, 1,
9, 9,
6, 6,
12, 12,
7, 7,
3, 3,
8, 8,
12, 12,
12, 12,
12, 12,
12, 12,
8, 8,
0, 0,
3, 3,
7, 7,
7, 7,
3, 3,
9, 9,
9, 9,
9, 9,
14, 14,
14, 14,
8, 8,
5, 5,
6, 6,
7, 7,
5, 5,
5, 5,
13, 13,
5, 5,
5, 5,
5, 5,
16, 16,
14, 14,
11, 11,
8, 8,
9, 9,
11, 11,
11, 11,
11, 11,
8, 8,
11, 11,
11, 11,
11, 11,
11, 11,
11, 11,
8, 8,
8, 8,
12, 12,
8, 8,
8, 8,
8, 8,
8, 8,
11, 11,
8, 8,
11, 11,
8, 8,
8, 8,
6, 6,
8, 8,
8, 8,
8, 8,
6, 6,
7, 7,
13, 13,
...] ...]
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
import pickle import pickle
#encoder_filename = "models/label_encoder.pkl" #encoder_filename = "models/label_encoder.pkl"
encoder_filename = "models/label_encoder_superdomains.pkl" encoder_filename = "models/label_encoder_superdomains.pkl"
with open(drive_path + encoder_filename, 'rb') as file: with open(drive_path + encoder_filename, 'rb') as file:
encoder = pickle.load(file) encoder = pickle.load(file)
``` ```
%% Output %% Output
/opt/homebrew/Caskroom/miniforge/base/envs/geode-classification-py39/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 1.0.2 when using version 1.1.3. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to: /opt/homebrew/Caskroom/miniforge/base/envs/geode-classification-py39/lib/python3.9/site-packages/sklearn/base.py:329: UserWarning: Trying to unpickle estimator LabelEncoder from version 1.0.2 when using version 1.1.3. This might lead to breaking code or invalid results. Use at your own risk. For more info please refer to:
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
warnings.warn( warnings.warn(
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
p2 = list(encoder.inverse_transform(pred)) p2 = list(encoder.inverse_transform(pred))
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df[corpus+'-superdomainBert'] = p2 df[corpus+'-superdomainBert'] = p2
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.head(10) df.head(10)
``` ```
%% Output %% Output
uid lge-volume lge-numero lge-head lge-page lge-id \ uid lge-volume lge-numero lge-head lge-page lge-id \
0 lge_1_a-0 1 1 A 0 a-0 0 lge_1_a-0 1 1 A 0 a-0
1 lge_1_a-1 1 2 A 1 a-1 1 lge_1_a-1 1 2 A 1 a-1
2 lge_1_a-2 1 3 A 4 a-2 2 lge_1_a-2 1 3 A 4 a-2
3 lge_1_a-3 1 4 A 4 a-3 3 lge_1_a-3 1 4 A 4 a-3
4 lge_1_a-4 1 5 A 4 a-4 4 lge_1_a-4 1 5 A 4 a-4
5 lge_1_aa-0 1 6 AA 4 aa-0 5 lge_1_aa-0 1 6 AA 4 aa-0
6 lge_1_aa-1 1 7 AA 4 aa-1 6 lge_1_aa-1 1 7 AA 4 aa-1
7 lge_1_aa-2 1 8 AA 5 aa-2 7 lge_1_aa-2 1 8 AA 5 aa-2
8 lge_1_aa-3 1 9 AA 5 aa-3 8 lge_1_aa-3 1 9 AA 5 aa-3
9 lge_1_aa-4 1 10 AA 5 aa-4 9 lge_1_aa-4 1 10 AA 5 aa-4
lge-content lge-nbWords \ lge-content lge-nbWords \
0 A(Ling.). Son vocal et première lettre de notr... 1761.0 0 A(Ling.). Son vocal et première lettre de notr... 1761.0
1 A(Paléogr.). C’est à l’alphabet phénicien, on ... 839.0 1 A(Paléogr.). C’est à l’alphabet phénicien, on ... 839.0
2 A(Log.). Cette voyelle désigne les proposition... 56.0 2 A(Log.). Cette voyelle désigne les proposition... 56.0
3 A(Mus.). La lettre a est employée par les musi... 267.0 3 A(Mus.). La lettre a est employée par les musi... 267.0
4 A(Numis.). Dans la numismatique grecque, la le... 67.0 4 A(Numis.). Dans la numismatique grecque, la le... 67.0
5 AA. Ces deux lettres désignent l’atelier monét... 14.0 5 AA. Ces deux lettres désignent l’atelier monét... 14.0
6 AA. Nom de plusieurs cours d’eau de l’Europe o... 75.0 6 AA. Nom de plusieurs cours d’eau de l’Europe o... 75.0
7 AA. Rivière de France, prend sa source aux Tro... 165.0 7 AA. Rivière de France, prend sa source aux Tro... 165.0
8 AA. Rivière de Hollande, affluent de la Dommel... 17.0 8 AA. Rivière de Hollande, affluent de la Dommel... 17.0
9 AA. Nom de deux fleuves de la Russie. Le premi... 71.0 9 AA. Nom de deux fleuves de la Russie. Le premi... 71.0
lge-superdomainBert lge-superdomainBert
0 Philosophie 0 Philosophie
1 Géographie 1 Géographie
2 Philosophie 2 Philosophie
3 Musique 3 Musique
4 Histoire 4 Histoire
5 Commerce 5 Commerce
6 Géographie 6 Géographie
7 Géographie 7 Géographie
8 Géographie 8 Géographie
9 Géographie 9 Géographie
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#df.to_csv(drive_path + "predictions/EDdA_dataset_articles_superdomainBERT_230313.tsv", sep="\t") #df.to_csv(drive_path + "predictions/EDdA_dataset_articles_superdomainBERT_230313.tsv", sep="\t")
df.to_csv(drive_path + "predictions/LGE_dataset_articles_superdomainBERT_230314.tsv", sep="\t", index=False) df.to_csv(drive_path + "predictions/LGE_dataset_articles_superdomainBERT_230314.tsv", sep="\t", index=False)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
#df.drop(columns=['contentLGE', 'contentEDdA'], inplace=True) #df.drop(columns=['contentLGE', 'contentEDdA'], inplace=True)
``` ```
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.loc[(df[corpus+'-superdomainBert'] == 'Géographie')] df.loc[(df[corpus+'-superdomainBert'] == 'Géographie')]
``` ```
%% Output %% Output
uid lge-volume lge-numero lge-head lge-page \ uid lge-volume lge-numero lge-head lge-page \
1 lge_1_a-1 1 2 A 1 1 lge_1_a-1 1 2 A 1
6 lge_1_aa-1 1 7 AA 4 6 lge_1_aa-1 1 7 AA 4
7 lge_1_aa-2 1 8 AA 5 7 lge_1_aa-2 1 8 AA 5
8 lge_1_aa-3 1 9 AA 5 8 lge_1_aa-3 1 9 AA 5
9 lge_1_aa-4 1 10 AA 5 9 lge_1_aa-4 1 10 AA 5
... ... ... ... ... ... ... ... ... ... ... ...
134800 lge_31_zvornix-0 31 7757 ZVORNIX 1370 134800 lge_31_zvornix-0 31 7757 ZVORNIX 1370
134801 lge_31_zweibrücken-0 31 7758 ZWEIBRÜCKEN 1370 134801 lge_31_zweibrücken-0 31 7758 ZWEIBRÜCKEN 1370
134803 lge_31_zwickau-0 31 7760 ZWICKAU 1370 134803 lge_31_zwickau-0 31 7760 ZWICKAU 1370
134806 lge_31_zwolle-0 31 7763 ZWOLLE 1371 134806 lge_31_zwolle-0 31 7763 ZWOLLE 1371
134819 lge_31_zyrmi-0 31 7776 ZYRMI 1372 134819 lge_31_zyrmi-0 31 7776 ZYRMI 1372
lge-id lge-content \ lge-id lge-content \
1 a-1 A(Paléogr.). C’est à l’alphabet phénicien, on ... 1 a-1 A(Paléogr.). C’est à l’alphabet phénicien, on ...
6 aa-1 AA. Nom de plusieurs cours d’eau de l’Europe o... 6 aa-1 AA. Nom de plusieurs cours d’eau de l’Europe o...
7 aa-2 AA. Rivière de France, prend sa source aux Tro... 7 aa-2 AA. Rivière de France, prend sa source aux Tro...
8 aa-3 AA. Rivière de Hollande, affluent de la Dommel... 8 aa-3 AA. Rivière de Hollande, affluent de la Dommel...
9 aa-4 AA. Nom de deux fleuves de la Russie. Le premi... 9 aa-4 AA. Nom de deux fleuves de la Russie. Le premi...
... ... ... ... ... ...
134800 zvornix-0 ZVORNIX. Ville de Bosnie, sur la r. g. de la D... 134800 zvornix-0 ZVORNIX. Ville de Bosnie, sur la r. g. de la D...
134801 zweibrücken-0 ZWEIBRÜCKEN. Ville de Bavière (V. Deux-Ponts).\n 134801 zweibrücken-0 ZWEIBRÜCKEN. Ville de Bavière (V. Deux-Ponts).\n
134803 zwickau-0 ZWICKAU. Ville de Saxe, ch.-l. d’un cercle, su... 134803 zwickau-0 ZWICKAU. Ville de Saxe, ch.-l. d’un cercle, su...
134806 zwolle-0 ZWOLLE. Ville des Pays-Bas, ch.-l. de la prov.... 134806 zwolle-0 ZWOLLE. Ville des Pays-Bas, ch.-l. de la prov....
134819 zyrmi-0 ZYRMI. Ville du Soudan. Ancienne capitale du p... 134819 zyrmi-0 ZYRMI. Ville du Soudan. Ancienne capitale du p...
lge-nbWords lge-superdomainBert lge-nbWords lge-superdomainBert
1 839.0 Géographie 1 839.0 Géographie
6 75.0 Géographie 6 75.0 Géographie
7 165.0 Géographie 7 165.0 Géographie
8 17.0 Géographie 8 17.0 Géographie
9 71.0 Géographie 9 71.0 Géographie
... ... ... ... ... ...
134800 27.0 Géographie 134800 27.0 Géographie
134801 6.0 Géographie 134801 6.0 Géographie
134803 92.0 Géographie 134803 92.0 Géographie
134806 115.0 Géographie 134806 115.0 Géographie
134819 16.0 Géographie 134819 16.0 Géographie
[50917 rows x 9 columns] [50917 rows x 9 columns]
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
df.shape df.shape
``` ```
%% Output %% Output
(134820, 9) (134820, 9)
%% Cell type:code id: tags: %% Cell type:code id: tags:
``` python ``` python
``` ```
......
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment