Book a Demo!
CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutPoliciesSign UpSign In
huggingface
GitHub Repository: huggingface/notebooks
Path: blob/main/transformers_doc/en/quantization/torchao.ipynb
4542 views
Kernel: Unknown Kernel

torchao is a PyTorch architecture optimization library with support for custom high performance data types, quantization, and sparsity. It is composable with native PyTorch features such as torch.compile for even faster inference and training.

To quantize a model, you need to install torchao and follow the examples below

!pip install torchao

If the execution runtime is GPU, the code will quantize the model on the GPU with device_map="auto".

import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer from torchao.quantization import Int8WeightOnlyConfig quant_config = Int8WeightOnlyConfig(group_size=128) quantization_config = TorchAoConfig(quant_type=quant_config) quantized_model = AutoModelForCausalLM.from_pretrained( "unsloth/Llama-3.2-1B-Instruct", torch_dtype="auto", device_map="auto", quantization_config=quantization_config ) tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct") input_text = "What are we having for dinner?" input_ids = tokenizer(input_text, return_tensors="pt").to("cuda") output = quantized_model.generate(**input_ids, max_new_tokens=10) print(tokenizer.decode(output[0], skip_special_tokens=True))

The example below will quantize the model on the CPU with device_map="cpu".

import torch from transformers import TorchAoConfig, AutoModelForCausalLM, AutoTokenizer from torchao.quantization import Int8WeightOnlyConfig quant_config = Int8WeightOnlyConfig(group_size=128) quantization_config = TorchAoConfig(quant_type=quant_config) quantized_model = AutoModelForCausalLM.from_pretrained( "unsloth/Llama-3.2-1B-Instruct", torch_dtype="auto", device_map="cpu", quantization_config=quantization_config ) tokenizer = AutoTokenizer.from_pretrained("unsloth/Llama-3.2-1B-Instruct") input_text = "What are we having for dinner?" input_ids = tokenizer(input_text, return_tensors="pt") output = quantized_model.generate(**input_ids, max_new_tokens=10) print(tokenizer.decode(output[0], skip_special_tokens=True))