GitHub Repository: huggingface/notebooks
Path: blob/main/sagemaker/29_deploy_llms_on_inferentia2/sagemaker-notebook.ipynb
⁴⁶⁸⁶ views

Kernel: hf

Deploy Zephyr 7B on AWS Inferentia2 using Amazon SageMaker

This tutorial will show how easy it is to deploy Zephyr 7B on AWS Infernetia2 using Amazon SageMaker. Zephyr is a 7B parameter LLM fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). More details are in the technical report. The model is released under the Apache 2.0 license, ensuring wide accessibility and use. We are going to show you how to:

Setup development environment
Retrieve the TGI Neuronx Image
Deploy Zephyr 7B to Amazon SageMaker
Run inference and chat with the model

Let’s get started.

1. Setup development environment

We are going to use the sagemaker python SDK to deploy Mixtral to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker python SDK installed.

In [1]:

!pip install transformers "sagemaker>=2.206.0" --upgrade --quiet

Out[1]:

/bin/pip:6: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
  from pkg_resources import load_entry_point
ERROR: sagemaker 2.206.0 has requirement PyYAML~=6.0, but you'll have pyyaml 5.3.1 which is incompatible.

If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.

In [2]:

import sagemaker
import boto3
sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket=None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client('iam')
    role = iam.get_role(RoleName='sagemaker_execution_role')['Role']['Arn']

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

Out[2]:

sagemaker.config INFO - Not applying SDK defaults from location: /etc/xdg/sagemaker/config.yaml
sagemaker.config INFO - Not applying SDK defaults from location: /home/ubuntu/.config/sagemaker/config.yaml

Couldn't call 'get_role' to get Role ARN from role name philippschmid to get Role path.

sagemaker role arn: arn:aws:iam::558105141721:role/sagemaker_execution_role
sagemaker session region: us-east-1

2. Retrieve TGI Neuronx Image

The new Hugging Face TGI Neuronx DLC can be used to run inference on AWS Inferentia2. To retrieve the URI for the desired Hugging Face TGI Neuronx DLC we can use the get_huggingface_tgi_neuronx_image_uri method provided by the sagemaker SDK. This method allows us to retrieve the URI for the desired Hugging Face TGI Neuronx DLC based on the specified backend, session, region, and version. You can find the available versions here

Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri method. We are going to use the raw container uri instead.

In [3]:

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
  "huggingface-neuronx",
  version="0.0.17"
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

Out[3]:

llm image uri: 763104351884.dkr.ecr.us-east-1.amazonaws.com/huggingface-pytorch-tgi-inference:1.13.1-optimum0.0.17-neuronx-py310-ubuntu22.04

4. Deploy Zephyr 7B to Amazon SageMaker

Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You can check the full list of supported models (text-generation) here. In this example, we will deploy Hugging Face Zephyr to Amazon SageMaker. Zephyr is a 7B parameter LLM fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). You can find more details in the technical report.

Compiling LLMs for Inferenetia2

At the time of writing, AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size in advanced. To make it easier for customers to utilize the full power of Inferentia2, we created a neuron model cache, which contains pre-compiled configurations for the most popular LLMs. A cached configuration is defined through a model architecture (Mistral), model size (7B), neuron version (2.16), number of inferentia cores (2), batch size (2), and sequence length (2048). This means compiling fine-tuned checkpoints for Mistral 7B with the same configuration will take only a few minutes. Examples of this are mistralai/Mistral-7B-v0.1 and HuggingFaceH4/zephyr-7b-beta.

Note: Currently, TGI can only load compiled checkpoints and models. We are working on an on-the-fly compilation based on the cache. This means that you can pass any model ID from the Hugging face Hub, e.g., HuggingFaceH4/zephyr-7b-beta if there is a cached configuration. This should be added in the next release. We update the blog here once released.

For the blog we compiled HuggingFaceH4/zephyr-7b-beta using the following command and parameters on a inf2.8xlarge instance and pushed it to the hub at :

# compile model with optimum for batch size 4 and sequence length 2048
optimum-cli export neuron -m HuggingFaceH4/zephyr-7b-beta --batch_size 4 --sequence_length 2048 --num_cores 2 --auto_cast_type bf16 ./zephyr-7b-beta-neuron
# push model to hub [repo_id] [local_path] [path_in_repo]
huggingface-cli upload  aws-neuron/zephyr-7b-seqlen-2048-bs-4 ./zephyr-7b-beta-neuron ./ --exclude "checkpoint/**"
# Move tokenizer to neuron model repository
python -c "from transformers import AutoTokenizer; AutoTokenizer.from_pretrained('HuggingFaceH4/zephyr-7b-beta').push_to_hub('aws-neuron/zephyr-7b-seqlen-2048-bs-4')"

If you are trying to compile an LLM with a configuration that is not yet cached, it can take up to 45 minutes.

Deploying TGI Neuronx Endpoint

Before deploying the model to Amazon SageMaker, we must define the TGI Neuronx endpoint configuration. Due to the current boundaries of Inferentia2, we need to make sure that the following parameters are set to the same value:

MAX_CONCURRENT_REQUESTS: Equals to the batch size, which was used to compile the model.
MAX_INPUT_LENGTH: Equals or lower than the sequence length, which was used to compile the model.
MAX_TOTAL_TOKENS: Equals to the sequence length, which was used to compile the model.
MAX_BATCH_PREFILL_TOKENS: half of the max tokens [batch_size * sequence_length] / 2
MAX_BATCH_TOTAL_TOKENS: Equals to the max tokens [batch_size * sequence_length]

In addition, we need to set the HF_MODEL_ID pointing to the Hugging Face model ID.

In [15]:

import json
from sagemaker.huggingface import HuggingFaceModel

# sagemaker config & model config
instance_type = "ml.inf2.8xlarge"
health_check_timeout = 900
batch_size = 4
sequence_length = 2048

# Define Model and Endpoint configuration parameter
config = {
  'HF_MODEL_ID': "aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2", 
  'MAX_CONCURRENT_REQUESTS': json.dumps(batch_size), 
  'MAX_INPUT_LENGTH': json.dumps(1512),
  'MAX_TOTAL_TOKENS': json.dumps(sequence_length),
  'MAX_BATCH_PREFILL_TOKENS': json.dumps(int(sequence_length*batch_size / 2)), 
  'MAX_BATCH_TOTAL_TOKENS': json.dumps(sequence_length*batch_size),  
}

# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(
  role=role,
  image_uri=llm_image,
  env=config
)

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.inf2.8xlarge instance type.

In [16]:

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm = llm_model.deploy(
  initial_instance_count=1,
  instance_type=instance_type,
  container_startup_health_check_timeout=health_check_timeout, # 10 minutes to be able to load the model
)

Out[16]:

Your model is not compiled. Please compile your model before using Inferentia.

------------------------!

SageMaker will create our endpoint and deploy the model to it. This can takes a 10-15 minutes.

5. Run inference and chat with the model

After our endpoint is deployed, we can run inference on it. We will use the predict method from the predictor to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined in the parameters attribute of the payload. You can find supported parameters in the here or in the open API specification of the TGI in the swagger documentation

The HuggingFaceH4/zephyr-7b-beta is a conversational chat model, meaning we can chat with it using the following prompt:

<|system|>\nYou are a friendly.</s>\n<|user|>\nInstruction</s>\n<|assistant|>\n

To avoid drafting the prompt, we can use the apply_chat_template method from the tokenizer, which expects a messages dictionary with the known OpenAI format and converts it into the correct format for the model. Let's see if Zephyr knows some facts about AWS.

In [ ]:

from transformers import AutoTokenizer

# load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("aws-neuron/zephyr-7b-seqlen-2048-bs-4-cores-2")

In [34]:

# Prompt to generate
messages = [
    {"role": "system", "content": "You are the AWS expert"},
    {"role": "user", "content": "Can you tell me an interesting fact abou AWS?"},
]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

# Generation arguments
payload = {
    "do_sample": True,
    "top_p": 0.6,
    "temperature": 0.9,
    "top_k": 50,
    "max_new_tokens": 256,
    "repetition_penalty": 1.03,
    "return_full_text": False,
    "stop": ["</s>"]
}
chat = llm.predict({"inputs":prompt, "parameters":payload})

print(chat[0]["generated_text"][len(prompt):])
# Sure, here's an interesting fact about AWS: As of 2021, AWS has more than 200 services in its portfolio, ranging from compute power and storage to databases,

Out[34]:

Sure, here's an interesting fact about AWS: As of 2021, AWS has more than 200 services in its portfolio, ranging from compute power and storage to databases, analytics, and machine learning. This vast array of services allows developers and businesses to build and deploy complex applications and workflows with flexibility and agility, without having to manage the underlying infrastructure. In fact, AWS's extensive service offerings have contributed to its dominance in the cloud computing market, with a market share of over 30% as of 2021.</s>

Awesome, we have successfully deployed Zephyr to Amazon SageMaker on Inferentia2 and chatted with it.

6. Clean up

To clean up, we can delete the model and endpoint.

In [35]:

llm.delete_model()
llm.delete_endpoint()