Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.
Path: blob/main/sagemaker/29_deploy_llms_on_inferentia2/sagemaker-notebook.ipynb
Views: 2543
Deploy Zephyr 7B on AWS Inferentia2 using Amazon SageMaker
This tutorial will show how easy it is to deploy Zephyr 7B on AWS Infernetia2 using Amazon SageMaker. Zephyr is a 7B parameter LLM fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). More details are in the technical report. The model is released under the Apache 2.0 license, ensuring wide accessibility and use. We are going to show you how to:
Setup development environment
Retrieve the TGI Neuronx Image
Deploy Zephyr 7B to Amazon SageMaker
Run inference and chat with the model
Let’s get started.
1. Setup development environment
We are going to use the sagemaker
python SDK to deploy Mixtral to Amazon SageMaker. We need to make sure to have an AWS account configured and the sagemaker
python SDK installed.
/bin/pip:6: DeprecationWarning: pkg_resources is deprecated as an API. See https://setuptools.pypa.io/en/latest/pkg_resources.html
from pkg_resources import load_entry_point
ERROR: sagemaker 2.206.0 has requirement PyYAML~=6.0, but you'll have pyyaml 5.3.1 which is incompatible.
If you are going to use Sagemaker in a local environment. You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
2. Retrieve TGI Neuronx Image
The new Hugging Face TGI Neuronx DLC can be used to run inference on AWS Inferentia2. To retrieve the URI for the desired Hugging Face TGI Neuronx DLC we can use the get_huggingface_tgi_neuronx_image_uri
method provided by the sagemaker
SDK. This method allows us to retrieve the URI for the desired Hugging Face TGI Neuronx DLC based on the specified backend
, session
, region
, and version
. You can find the available versions here
Note: At the time of writing this blog post the latest version of the Hugging Face LLM DLC is not yet available via the get_huggingface_llm_image_uri
method. We are going to use the raw container uri instead.
4. Deploy Zephyr 7B to Amazon SageMaker
Text Generation Inference (TGI) on Inferentia2 supports popular open LLMs, including Llama, Mistral, and more. You can check the full list of supported models (text-generation) here. In this example, we will deploy Hugging Face Zephyr to Amazon SageMaker. Zephyr is a 7B parameter LLM fine-tuned version of mistralai/Mistral-7B-v0.1 that was trained on a mix of publicly available, synthetic datasets using Direct Preference Optimization (DPO). You can find more details in the technical report.
Compiling LLMs for Inferenetia2
At the time of writing, AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size in advanced. To make it easier for customers to utilize the full power of Inferentia2, we created a neuron model cache, which contains pre-compiled configurations for the most popular LLMs. A cached configuration is defined through a model architecture (Mistral), model size (7B), neuron version (2.16), number of inferentia cores (2), batch size (2), and sequence length (2048). This means compiling fine-tuned checkpoints for Mistral 7B with the same configuration will take only a few minutes. Examples of this are mistralai/Mistral-7B-v0.1 and HuggingFaceH4/zephyr-7b-beta.
Note: Currently, TGI can only load compiled checkpoints and models. We are working on an on-the-fly compilation based on the cache. This means that you can pass any model ID from the Hugging face Hub, e.g., HuggingFaceH4/zephyr-7b-beta
if there is a cached configuration. This should be added in the next release. We update the blog here once released.
For the blog we compiled HuggingFaceH4/zephyr-7b-beta
using the following command and parameters on a inf2.8xlarge
instance and pushed it to the hub at :
If you are trying to compile an LLM with a configuration that is not yet cached, it can take up to 45 minutes.
Deploying TGI Neuronx Endpoint
Before deploying the model to Amazon SageMaker, we must define the TGI Neuronx endpoint configuration. Due to the current boundaries of Inferentia2, we need to make sure that the following parameters are set to the same value:
MAX_CONCURRENT_REQUESTS
: Equals to the batch size, which was used to compile the model.MAX_INPUT_LENGTH
: Equals or lower than the sequence length, which was used to compile the model.MAX_TOTAL_TOKENS
: Equals to the sequence length, which was used to compile the model.MAX_BATCH_PREFILL_TOKENS
: half of the max tokens [batch_size * sequence_length] / 2MAX_BATCH_TOTAL_TOKENS
: Equals to the max tokens [batch_size * sequence_length]
In addition, we need to set the HF_MODEL_ID
pointing to the Hugging Face model ID.
After we have created the HuggingFaceModel
we can deploy it to Amazon SageMaker using the deploy
method. We will deploy the model with the ml.inf2.8xlarge
instance type.
SageMaker will create our endpoint and deploy the model to it. This can takes a 10-15 minutes.
5. Run inference and chat with the model
After our endpoint is deployed, we can run inference on it. We will use the predict
method from the predictor
to run inference on our endpoint. We can inference with different parameters to impact the generation. Parameters can be defined in the parameters
attribute of the payload. You can find supported parameters in the here or in the open API specification of the TGI in the swagger documentation
The HuggingFaceH4/zephyr-7b-beta
is a conversational chat model, meaning we can chat with it using the following prompt:
To avoid drafting the prompt, we can use the apply_chat_template
method from the tokenizer, which expects a messages
dictionary with the known OpenAI format and converts it into the correct format for the model. Let's see if Zephyr knows some facts about AWS.
Awesome, we have successfully deployed Zephyr to Amazon SageMaker on Inferentia2 and chatted with it.
6. Clean up
To clean up, we can delete the model and endpoint.