CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/fr/chapter8/section4_tf.ipynb
Views: 2555
Kernel: Python 3

Déboguer le pipeline d'entraînement

Ce chapitre portant sur le débogage, la langue nous importe peu ici. Nous nous intéressons surtout à la logique du code pour comprendre d'où provient l'erreur.

Installez les bibliothèques 🤗 Transformers et 🤗 Datasets pour exécuter ce notebook.

!pip install datasets transformers[sentencepiece]
from datasets import load_dataset, load_metric from transformers import ( AutoTokenizer, TFAutoModelForSequenceClassification, ) raw_datasets = load_dataset("glue", "mnli") model_checkpoint = "distilbert-base-uncased" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint) def preprocess_function(examples): return tokenizer(examples["premise"], examples["hypothesis"], truncation=True) tokenized_datasets = raw_datasets.map(preprocess_function, batched=True) train_dataset = tokenized_datasets["train"].to_tf_dataset( columns=["input_ids", "labels"], batch_size=16, shuffle=True ) validation_dataset = tokenized_datasets["validation_matched"].to_tf_dataset( columns=["input_ids", "labels"], batch_size=16, shuffle=True ) model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint) model.compile(loss="sparse_categorical_crossentropy", optimizer="adam") model.fit(train_dataset)
for batch in train_dataset: break
model.compile(optimizer="adam")
model(batch)
model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint) model(batch)
import numpy as np loss = model(batch).loss.numpy() indices = np.flatnonzero(np.isnan(loss)) indices
input_ids = batch["input_ids"].numpy() input_ids[indices]
model.config.num_labels
from tensorflow.keras.optimizers import Adam model = TFAutoModelForSequenceClassification.from_pretrained(model_checkpoint) model.compile(optimizer=Adam(5e-5))
model.fit(train_dataset)
input_ids = batch["input_ids"].numpy() tokenizer.decode(input_ids[0])
labels = batch["labels"].numpy() label = labels[0]
for batch in train_dataset: break # Assurez-vous que vous avez exécuté model.compile() et défini votre optimiseur, # et vos pertes/métriques si vous les utilisez model.fit(batch, epochs=20)