CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/en/chapter11/section2.ipynb
Views: 2935
Kernel: .venv

Exploring Chat Templates with SmolLM2

This notebook demonstrates how to use chat templates with the SmolLM2 model. Chat templates help structure interactions between users and AI models, ensuring consistent and contextually appropriate responses.

# Install the requirements in Google Colab # !pip install transformers datasets trl huggingface_hub # Authenticate to Hugging Face from huggingface_hub import login login() # for convenience you can create an environment variable containing your hub token as HF_TOKEN
# Import necessary libraries from transformers import AutoModelForCausalLM, AutoTokenizer from trl import setup_chat_format import torch

SmolLM2 Chat Template

Let's explore how to use a chat template with the SmolLM2 model. We'll define a simple conversation and apply the chat template.

# Dynamically set the device device = ( "cuda" if torch.cuda.is_available() else "mps" if torch.backends.mps.is_available() else "cpu" ) model_name = "HuggingFaceTB/SmolLM2-135M" model = AutoModelForCausalLM.from_pretrained( pretrained_model_name_or_path=model_name ).to(device) tokenizer = AutoTokenizer.from_pretrained(pretrained_model_name_or_path=model_name) model, tokenizer = setup_chat_format(model=model, tokenizer=tokenizer)
# Define messages for SmolLM2 messages = [ {"role": "user", "content": "Hello, how are you?"}, { "role": "assistant", "content": "I'm doing well, thank you! How can I assist you today?", }, ]

Apply chat template without tokenization

The tokenizer represents the conversation as a string with special tokens to describe the role of the user and the assistant.

input_text = tokenizer.apply_chat_template(messages, tokenize=False) print("Conversation with template:", input_text)
Conversation with template: <|im_start|>user Hello, how are you?<|im_end|> <|im_start|>assistant I'm doing well, thank you! How can I assist you today?<|im_end|>

Decode the conversation

Note that the conversation is represented as above but with a further assistant message.

input_text = tokenizer.apply_chat_template( messages, tokenize=True, add_generation_prompt=True ) print("Conversation decoded:", tokenizer.decode(token_ids=input_text))
Conversation decoded: <|im_start|>user Hello, how are you?<|im_end|> <|im_start|>assistant I'm doing well, thank you! How can I assist you today?<|im_end|> <|im_start|>assistant

Tokenize the conversation

Of course, the tokenizer also tokenizes the conversation and special token as ids that relate to the model's vocabulary.

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True) print("Conversation tokenized:", input_text)
Conversation tokenized: [1, 4093, 198, 19556, 28, 638, 359, 346, 47, 2, 198, 1, 520, 9531, 198, 57, 5248, 2567, 876, 28, 9984, 346, 17, 1073, 416, 339, 4237, 346, 1834, 47, 2, 198, 1, 520, 9531, 198]

Exercise: Process a dataset for SFT

Take a dataset from the Hugging Face hub and process it for SFT.

Difficulty Levels

🐢 Convert the `HuggingFaceTB/smoltalk` dataset into chatml format.

🐕 Convert the `openai/gsm8k` dataset into chatml format.

from IPython.core.display import display, HTML display( HTML( """<iframe src="https://huggingface.co/datasets/HuggingFaceTB/smoltalk/embed/viewer/all/train?row=0" frameborder="0" width="100%" height="360px" ></iframe> """ ) )
from datasets import load_dataset ds = load_dataset("HuggingFaceTB/smoltalk", "everyday-conversations") def process_dataset(sample): # TODO: 🐢 Convert the sample into a chat format # use the tokenizer's method to apply the chat template return sample ds = ds.map(process_dataset)
display( HTML( """<iframe src="https://huggingface.co/datasets/openai/gsm8k/embed/viewer/main/train" frameborder="0" width="100%" height="360px" ></iframe> """ ) )
ds = load_dataset("openai/gsm8k", "main") def process_dataset(sample): # TODO: 🐕 Convert the sample into a chat format # 1. create a message format with the role and content # 2. apply the chat template to the samples using the tokenizer's method return sample ds = ds.map(process_dataset)

Conclusion

This notebook demonstrated how to apply chat templates to different models, SmolLM2. By structuring interactions with chat templates, we can ensure that AI models provide consistent and contextually relevant responses.

In the exercise you tried out converting a dataset into chatml format. Luckily, TRL will do this for you, but it's useful to understand what's going on under the hood.