GitHub Repository: huggingface/notebooks
Path: blob/main/transformers_doc/en/tensorflow/audio_classification.ipynb
⁴⁷⁷⁴ views

Kernel: Unknown Kernel

In [ ]:

# Transformers installation
! pip install transformers datasets evaluate accelerate
# To install from source instead of the last release, comment the command above and uncomment the following one.
# ! pip install git+https://github.com/huggingface/transformers.git

Audio classification

In [ ]:

#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/KWwzcmG98Ds?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

Audio classification - just like with text - assigns a class label as output from the input data. The only difference is instead of text inputs, you have raw audio waveforms. Some practical applications of audio classification include identifying speaker intent, language classification, and even animal species by their sounds.

This guide will show you how to:

Fine-tune Wav2Vec2 on the MInDS-14 dataset to classify speaker intent.
Use your fine-tuned model for inference.

[removed]

To see all architectures and checkpoints compatible with this task, we recommend checking the task-page

Before you begin, make sure you have all the necessary libraries installed:

pip install transformers datasets evaluate

We encourage you to login to your Hugging Face account so you can upload and share your model with the community. When prompted, enter your token to login:

In [ ]:

from huggingface_hub import notebook_login

notebook_login()

Load MInDS-14 dataset

Start by loading the MInDS-14 dataset from the 🤗 Datasets library:

In [ ]:

from datasets import load_dataset, Audio

minds = load_dataset("PolyAI/minds14", name="en-US", split="train")

Split the dataset's train split into a smaller train and test set with the train_test_split method. This will give you a chance to experiment and make sure everything works before spending more time on the full dataset.

In [ ]:

minds = minds.train_test_split(test_size=0.2)

Then take a look at the dataset:

In [ ]:

minds

DatasetDict({
    train: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 450
    })
    test: Dataset({
        features: ['path', 'audio', 'transcription', 'english_transcription', 'intent_class', 'lang_id'],
        num_rows: 113
    })
})

While the dataset contains a lot of useful information, like lang_id and english_transcription, you will focus on the audio and intent_class in this guide. Remove the other columns with the remove_columns method:

In [ ]:

minds = minds.remove_columns(["path", "transcription", "english_transcription", "lang_id"])

Here's an example:

In [ ]:

minds["train"][0]

{'audio': {'array': array([ 0.        ,  0.        ,  0.        , ..., -0.00048828,
         -0.00024414, -0.00024414], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 8000},
 'intent_class': 2}

There are two fields:

audio: a 1-dimensional array of the speech signal that must be called to load and resample the audio file.
intent_class: represents the class id of the speaker's intent.

To make it easier for the model to get the label name from the label id, create a dictionary that maps the label name to an integer and vice versa:

In [ ]:

labels = minds["train"].features["intent_class"].names
label2id, id2label = dict(), dict()
for i, label in enumerate(labels):
    label2id[label] = str(i)
    id2label[str(i)] = label

Now you can convert the label id to a label name:

In [ ]:

id2label[str(2)]

'app_error'

Preprocess

The next step is to load a Wav2Vec2 feature extractor to process the audio signal:

In [ ]:

from transformers import AutoFeatureExtractor

feature_extractor = AutoFeatureExtractor.from_pretrained("facebook/wav2vec2-base")

The MInDS-14 dataset has a sampling rate of 8kHz (you can find this information in its dataset card), which means you'll need to resample the dataset to 16kHz to use the pretrained Wav2Vec2 model:

In [ ]:

minds = minds.cast_column("audio", Audio(sampling_rate=16_000))
minds["train"][0]

{'audio': {'array': array([ 2.2098757e-05,  4.6582241e-05, -2.2803260e-05, ...,
         -2.8419291e-04, -2.3305941e-04, -1.1425107e-04], dtype=float32),
  'path': '/root/.cache/huggingface/datasets/downloads/extracted/f14948e0e84be638dd7943ac36518a4cf3324e8b7aa331c5ab11541518e9368c/en-US~APP_ERROR/602b9a5fbb1e6d0fbce91f52.wav',
  'sampling_rate': 16000},
 'intent_class': 2}

Now create a preprocessing function that:

Calls the audio column to load, and if necessary, resample the audio file.
Checks if the sampling rate of the audio file matches the sampling rate of the audio data a model was pretrained with. You can find this information in the Wav2Vec2 model card.
Set a maximum input length to batch longer inputs without truncating them.

In [ ]:

def preprocess_function(examples):
    audio_arrays = [x["array"] for x in examples["audio"]]
    inputs = feature_extractor(
        audio_arrays, sampling_rate=feature_extractor.sampling_rate, max_length=16000, truncation=True
    )
    return inputs

To apply the preprocessing function over the entire dataset, use 🤗 Datasets map function. You can speed up map by setting batched=True to process multiple elements of the dataset at once. Remove unnecessary columns and rename intent_class to label, as required by the model:

In [ ]:

encoded_minds = minds.map(preprocess_function, remove_columns="audio", batched=True)
encoded_minds = encoded_minds.rename_column("intent_class", "label")

Evaluate

Including a metric during training is often helpful for evaluating your model's performance. You can quickly load an evaluation method with the 🤗 Evaluate library. For this task, load the accuracy metric (see the 🤗 Evaluate quick tour to learn more about how to load and compute a metric):

In [ ]:

import evaluate

accuracy = evaluate.load("accuracy")

Then create a function that passes your predictions and labels to compute to calculate the accuracy:

In [ ]:

import numpy as np


def compute_metrics(eval_pred):
    predictions = np.argmax(eval_pred.predictions, axis=1)
    return accuracy.compute(predictions=predictions, references=eval_pred.label_ids)

Your compute_metrics function is ready to go now, and you'll return to it when you setup your training.

Train

[removed]

For a more in-depth example of how to fine-tune a model for audio classification, take a look at the corresponding PyTorch notebook.

Inference

Great, now that you've fine-tuned a model, you can use it for inference!

Load an audio file for inference. Remember to resample the sampling rate of the audio file to match the model's sampling rate, if necessary.

In [ ]:

from datasets import load_dataset, Audio

dataset = load_dataset("PolyAI/minds14", name="en-US", split="train")
dataset = dataset.cast_column("audio", Audio(sampling_rate=16000))
sampling_rate = dataset.features["audio"].sampling_rate
audio_file = dataset[0]["audio"]["path"]

The simplest way to try out your fine-tuned model for inference is to use it in a pipeline(). Instantiate a pipeline for audio classification with your model, and pass your audio file to it:

In [ ]:

from transformers import pipeline

classifier = pipeline("audio-classification", model="stevhliu/my_awesome_minds_model")
classifier(audio_file)

[
    {'score': 0.09766869246959686, 'label': 'cash_deposit'},
    {'score': 0.07998877018690109, 'label': 'app_error'},
    {'score': 0.0781070664525032, 'label': 'joint_account'},
    {'score': 0.07667109370231628, 'label': 'pay_bill'},
    {'score': 0.0755252093076706, 'label': 'balance'}
]

You can also manually replicate the results of the pipeline if you'd like:

Audio classification

Load MInDS-14 dataset

Preprocess

Evaluate

Train

Inference

Product

Resources

Company