Path: blob/main/transformers_doc/en/video_text_to_text.ipynb
4522 views
Video-text-to-text
Video-text-to-text models, also known as video language models or vision language models with video input, are language models that take a video input. These models can tackle various tasks, from video question answering to video captioning.
These models have nearly the same architecture as image-text-to-text models except for some changes to accept video data, since video data is essentially image frames with temporal dependencies. Some image-text-to-text models take in multiple images, but this alone is inadequate for a model to accept videos. Moreover, video-text-to-text models are often trained with all vision modalities. Each example might have videos, multiple videos, images and multiple images. Some of these models can also take interleaved inputs. For example, you can refer to a specific video inside a string of text by adding a video token in text like "What is happening in this video? <video>
".
In this guide, we provide a brief overview of video LMs and show how to use them with Transformers for inference.
To begin with, there are multiple types of video LMs:
base models used for fine-tuning
chat fine-tuned models for conversation
instruction fine-tuned models
This guide focuses on inference with an instruction-tuned model, llava-hf/llava-interleave-qwen-7b-hf which can take in interleaved data. Alternatively, you can try llava-interleave-qwen-0.5b-hf if your hardware doesn't allow running a 7B model.
Let's begin installing the dependencies.
Let's initialize the model and the processor.
Some models directly consume the <video>
token, and others accept <image>
tokens equal to the number of sampled frames. This model handles videos in the latter fashion. We will write a simple utility to handle image tokens, and another utility to get a video from a url and sample frames from it.
Let's get our inputs. We will sample frames and concatenate them.
Both videos have cats.
Now we can preprocess the inputs.
This model has a prompt template that looks like following. First, we'll put all the sampled frames into one list. Since we have eight frames in each video, we will insert 12 <image>
tokens to our prompt. Add assistant
at the end of the prompt to trigger the model to give answers. Then we can preprocess.
We can now call generate() for inference. The model outputs the question in our input and answer, so we only take the text after the prompt and assistant
part from the model output.
And voila!
To learn more about chat templates and token streaming for video-text-to-text models, refer to the image-text-to-text task guide because these models work similarly.