Path: blob/main/sagemaker/18_inferentia_inference/sagemaker-notebook.ipynb
6544 views
Accelerate BERT Inference with Hugging Face Transformers and AWS inferentia
In this end-to-end tutorial, you will learn how to speed up BERT inference for text classification with Hugging Face Transformers, Amazon SageMaker, and AWS Inferentia.
You will learn how to:
- Convert your Hugging Face Transformer to AWS Neuron (Inferentia) 
- Create a custom - inference.pyscript for- text-classification
- Create and upload the neuron model and inference script to Amazon S3 
- Deploy a Real-time Inference Endpoint on Amazon SageMaker 
- Run and evaluate Inference performance of BERT on Inferentia 
Let's get started! 🚀
If you are going to use Sagemaker in a local environment (not SageMaker Studio or Notebook Instances). You need access to an IAM Role with the required permissions for Sagemaker. You can find here more about it.
1. Convert your Hugging Face Transformer to AWS Neuron
We are going to use the AWS Neuron SDK for AWS Inferentia. The Neuron SDK includes a deep learning compiler, runtime, and tools for converting and compiling PyTorch and TensorFlow models to neuron compatible models, which can be run on EC2 Inf1 instances.
As a first step, we need to install the Neuron SDK and the required packages.
Tip: If you are using Amazon SageMaker Notebook Instances or Studio you can go with the conda_python3 conda kernel.
After we have installed the Neuron SDK we can convert load and convert our model. Neuron models are converted using torch_neuron with its trace method similar to torchscript. You can find more information in our documentation.
To be able to convert our model we first need to select the model we want to use for our text classification pipeline from hf.co/models. For this example lets go with distilbert-base-uncased-finetuned-sst-2-english but this can be easily adjusted with other BERT-like models.
At the time of writing, the AWS Neuron SDK does not support dynamic shapes, which means that the input size needs to be static for compiling and inference.
In simpler terms, this means when the model is compiled with an input of batch size 1 and sequence length of 16. The model can only run inference on inputs with the same shape.
When using a t2.medium instance the compiling takes around 2-3 minutes 
2. Create a custom inference.py script for text-classification
The Hugging Face Inference Toolkit supports zero-code deployments on top of the pipeline feature from 🤗 Transformers. This allows users to deploy Hugging Face transformers without an inference script [Example].
Currently is this feature not supported with AWS Inferentia, which means we need to provide an inference.py for running inference. 
If you would be interested in support for zero-code deployments for inferentia let us know on the forum.
To use the inference script, we need to create an inference.py script. In our example, we are going to overwrite the model_fn to load our neuron model and the predict_fn to create a text-classification pipeline. 
If you want to know more about the inference.py script check out this example. It explains amongst other things what the model_fn and predict_fn are.
We are using the NEURON_RT_NUM_CORES=1 to make sure that each HTTP worker uses 1 Neuron core to maximize throughput.
3. Create and upload the neuron model and inference script to Amazon S3
Before we can deploy our neuron model to Amazon SageMaker we need to create a model.tar.gz archive with all our model artifacts saved into tmp/, e.g. neuron_model.pt and upload this to Amazon S3.
To do this we need to set up our permissions.
Next, we create our model.tar.gz.The inference.py script will be placed into a code/ folder.
Now we can upload our model.tar.gz to our session S3 bucket with sagemaker.
4. Deploy a Real-time Inference Endpoint on Amazon SageMaker
After we have uploaded our model.tar.gz to Amazon S3 can we create a custom HuggingfaceModel. This class will be used to create and deploy our real-time inference endpoint on Amazon SageMaker.
5. Run and evaluate Inference performance of BERT on Inferentia
The .deploy() returns an HuggingFacePredictor object which can be used to request inference.
We managed to deploy our neuron compiled BERT to AWS Inferentia on Amazon SageMaker. Now, let's test its performance of it. As a dummy load test will we loop and send 10000 synchronous requests to our endpoint.
Let's inspect the performance in cloudwatch.
The average latency for our BERT model is 5-6ms for a sequence length of 128.

Delete model and endpoint
To clean up, we can delete the model and endpoint.