GitHub Repository: huggingface/notebooks
Path: blob/main/transformers_doc/ja/pytorch/question_answering.ipynb
⁴⁵⁴⁴ views

Kernel: Unknown Kernel

Question answering

In [ ]:

#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/ajPx5LwJD-I?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

質問応答タスクは、質問に対して回答を返します。 Alexa、Siri、Google などの仮想アシスタントに天気を尋ねたことがあるなら、質問応答モデルを使用したことがあるはずです。質問応答タスクには一般的に 2 つのタイプがあります。

抽出: 与えられたコンテキストから回答を抽出します。
抽象的: 質問に正しく答えるコンテキストから回答を生成します。

このガイドでは、次の方法を説明します。

抽出的質問応答用に SQuAD データセット上の DistilBERT を微調整します。
微調整したモデルを推論に使用します。

[removed]

このタスクと互換性のあるすべてのアーキテクチャとチェックポイントを確認するには、タスクページを確認することをお勧めします。

始める前に、必要なライブラリがすべてインストールされていることを確認してください。

pip install transformers datasets evaluate

モデルをアップロードしてコミュニティと共有できるように、Hugging Face アカウントにログインすることをお勧めします。プロンプトが表示されたら、トークンを入力してログインします。

In [ ]:

from huggingface_hub import notebook_login

notebook_login()

Load SQuAD dataset

まず、🤗 データセットライブラリから SQuAD データセットの小さいサブセットを読み込みます。これにより、完全なデータセットのトレーニングにさらに時間を費やす前に、実験してすべてが機能することを確認する機会が得られます。

In [ ]:

from datasets import load_dataset

squad = load_dataset("squad", split="train[:5000]")

train_test_split メソッドを使用して、データセットの train 分割をトレインセットとテストセットに分割します。

In [ ]:

squad = squad.train_test_split(test_size=0.2)

次に、例を見てみましょう。

In [ ]:

squad["train"][0]

{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
 'context': 'Architecturally, the school has a Catholic character. Atop the Main Building\'s gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
 'id': '5733be284776f41900661182',
 'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
 'title': 'University_of_Notre_Dame'
}

ここにはいくつかの重要なフィールドがあります。

answers: 回答トークンと回答テキストの開始位置。
context: モデルが答えを抽出するために必要な背景情報。
question: モデルが答える必要がある質問。

Preprocess

In [ ]:

#@title
from IPython.display import HTML

HTML('<iframe width="560" height="315" src="https://www.youtube.com/embed/qgaM0weJHpA?rel=0&amp;controls=0&amp;showinfo=0" frameborder="0" allowfullscreen></iframe>')

次のステップでは、DistilBERT トークナイザーをロードしてquestionフィールドとcontextフィールドを処理します。

In [ ]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("distilbert/distilbert-base-uncased")

質問応答タスクに特有の、注意すべき前処理手順がいくつかあります。

データセット内の一部の例には、モデルの最大入力長を超える非常に長い「コンテキスト」が含まれる場合があります。より長いシーケンスを処理するには、truncation="only_second" を設定して context のみを切り捨てます。
次に、設定によって、回答の開始位置と終了位置を元の contextにマッピングします。「return_offset_mapping=True」。
マッピングが手元にあるので、答えの開始トークンと終了トークンを見つけることができます。 sequence_ids メソッドを使用して、オフセットのどの部分がquestionに対応し、どの部分がcontextに対応するかを見つけます。

以下に、answerの開始トークンと終了トークンを切り詰めてcontextにマッピングする関数を作成する方法を示します。

In [ ]:

def preprocess_function(examples):
    questions = [q.strip() for q in examples["question"]]
    inputs = tokenizer(
        questions,
        examples["context"],
        max_length=384,
        truncation="only_second",
        return_offsets_mapping=True,
        padding="max_length",
    )

    offset_mapping = inputs.pop("offset_mapping")
    answers = examples["answers"]
    start_positions = []
    end_positions = []

    for i, offset in enumerate(offset_mapping):
        answer = answers[i]
        start_char = answer["answer_start"][0]
        end_char = answer["answer_start"][0] + len(answer["text"][0])
        sequence_ids = inputs.sequence_ids(i)

        # Find the start and end of the context
        idx = 0
        while sequence_ids[idx] != 1:
            idx += 1
        context_start = idx
        while sequence_ids[idx] == 1:
            idx += 1
        context_end = idx - 1

        # If the answer is not fully inside the context, label it (0, 0)
        if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
            start_positions.append(0)
            end_positions.append(0)
        else:
            # Otherwise it's the start and end token positions
            idx = context_start
            while idx <= context_end and offset[idx][0] <= start_char:
                idx += 1
            start_positions.append(idx - 1)

            idx = context_end
            while idx >= context_start and offset[idx][1] >= end_char:
                idx -= 1
            end_positions.append(idx + 1)

    inputs["start_positions"] = start_positions
    inputs["end_positions"] = end_positions
    return inputs

データセット全体に前処理関数を適用するには、🤗 Datasets map 関数を使用します。 batched=True を設定してデータセットの複数の要素を一度に処理することで、map 関数を高速化できます。不要な列を削除します。

In [ ]:

tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)

次に、DefaultDataCollator を使用してサンプルのバッチを作成します。 🤗 Transformers の他のデータ照合器とは異なり、DefaultDataCollator はパディングなどの追加の前処理を適用しません。

In [ ]:

from transformers import DefaultDataCollator

data_collator = DefaultDataCollator()

Train

[removed]

Trainer を使用したモデルの微調整に慣れていない場合は、ここの基本的なチュートリアルをご覧ください。

これでモデルのトレーニングを開始する準備が整いました。 AutoModelForQuestionAnswering を使用して DitilBERT をロードします。

In [ ]:

from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer

model = AutoModelForQuestionAnswering.from_pretrained("distilbert/distilbert-base-uncased")

この時点で残っている手順は次の 3 つだけです。

TrainingArguments でトレーニングハイパーパラメータを定義します。唯一の必須パラメータは、モデルの保存場所を指定する output_dir です。 push_to_hub=Trueを設定して、このモデルをハブにプッシュします (モデルをアップロードするには、Hugging Face にサインインする必要があります)。
トレーニング引数をモデル、データセット、トークナイザー、データ照合器とともに Trainer に渡します。
train() を呼び出してモデルを微調整します。

In [ ]:

training_args = TrainingArguments(
    output_dir="my_awesome_qa_model",
    eval_strategy="epoch",
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    num_train_epochs=3,
    weight_decay=0.01,
    push_to_hub=True,
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized_squad["train"],
    eval_dataset=tokenized_squad["test"],
    processing_class=tokenizer,
    data_collator=data_collator,
)

trainer.train()

トレーニングが完了したら、 push_to_hub() メソッドを使用してモデルをハブに共有し、誰もがモデルを使用できるようにします。

In [ ]:

trainer.push_to_hub()

[removed]

質問応答用のモデルを微調整する方法の詳細な例については、対応するドキュメントを参照してください。 PyTorch ノートブックまたは TensorFlow ノートブック。

Evaluate

質問応答の評価には、大量の後処理が必要です。時間がかかりすぎないように、このガイドでは評価ステップを省略しています。 Trainer はトレーニング中に評価損失を計算するため、モデルのパフォーマンスについて完全に分からないわけではありません。

もっと時間があり、質問応答用のモデルを評価する方法に興味がある場合は、質問応答の章を参照してください。 🤗ハグフェイスコースから！

Inference

モデルを微調整したので、それを推論に使用できるようになりました。

質問と、モデルに予測させたいコンテキストを考え出します。

In [ ]:

question = "How many programming languages does BLOOM support?"
context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."

推論用に微調整されたモデルを試す最も簡単な方法は、それを pipeline() で使用することです。モデルを使用して質問応答用のpipelineをインスタンス化し、それにテキストを渡します。

In [ ]:

from transformers import pipeline

question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
question_answerer(question=question, context=context)

{'score': 0.2058267742395401,
 'start': 10,
 'end': 95,
 'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}

必要に応じて、pipelineの結果を手動で複製することもできます。

テキストをトークン化して PyTorch テンソルを返します。

In [ ]:

from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
inputs = tokenizer(question, context, return_tensors="pt")

入力をモデルに渡し、logitsを返します。

In [ ]:

import torch
from transformers import AutoModelForQuestionAnswering

model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
with torch.no_grad():
    outputs = model(**inputs)

モデル出力から開始位置と終了位置の最も高い確率を取得します。

In [ ]:

answer_start_index = outputs.start_logits.argmax()
answer_end_index = outputs.end_logits.argmax()

予測されたトークンをデコードして答えを取得します。

In [ ]:

predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
tokenizer.decode(predict_answer_tokens)

'176 billion parameters and can generate text in 46 languages natural languages and 13'

Question answering

Load SQuAD dataset

Preprocess

Train

Evaluate

Inference

Product

Resources

Company