CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/fr/chapter6/section4.ipynb
Views: 2555
Kernel: Python 3

Normalisation et prétokenization.

Installez les bibliothèques 🤗 Transformers et 🤗 Datasets pour exécuter ce notebook.

!pip install datasets transformers[sentencepiece]
from transformers import AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("camembert-base") print(type(tokenizer.backend_tokenizer))
print(tokenizer.backend_tokenizer.normalizer.normalize_str("Héllò hôw are ü?"))
# Ne semble pas marcher sur le français tokenizer_fr = AutoTokenizer.from_pretrained("camembert-base") tokenizer_fr.backend_tokenizer.normalizer.normalize_str("Bönjoùr commènt vas tü ?")
tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
tokenizer = AutoTokenizer.from_pretrained("gpt2") tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")
tokenizer = AutoTokenizer.from_pretrained("t5-small") tokenizer.backend_tokenizer.pre_tokenizer.pre_tokenize_str("Hello, how are you?")