CoCalc Logo Icon
StoreFeaturesDocsShareSupportNewsAboutSign UpSign In
huggingface

Real-time collaboration for Jupyter Notebooks, Linux Terminals, LaTeX, VS Code, R IDE, and more,
all in one place. Commercial Alternative to JupyterHub.

GitHub Repository: huggingface/notebooks
Path: blob/main/course/ja/chapter7/section5_tf.ipynb
Views: 2555
Kernel: Unknown Kernel

要約 (TensorFlow)

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

!pip install datasets evaluate transformers[sentencepiece] !apt install git-lfs

You will need to setup git, adapt your email and name in the following cell.

!git config --global user.email "[email protected]" !git config --global user.name "Your Name"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

from huggingface_hub import notebook_login notebook_login()
from datasets import load_dataset spanish_dataset = load_dataset("amazon_reviews_multi", "es") english_dataset = load_dataset("amazon_reviews_multi", "en") english_dataset
DatasetDict({ train: Dataset({ features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'], num_rows: 200000 }) validation: Dataset({ features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'], num_rows: 5000 }) test: Dataset({ features: ['review_id', 'product_id', 'reviewer_id', 'stars', 'review_body', 'review_title', 'language', 'product_category'], num_rows: 5000 }) })
def show_samples(dataset, num_samples=3, seed=42): sample = dataset["train"].shuffle(seed=seed).select(range(num_samples)) for example in sample: print(f"\n'>> Title: {example['review_title']}'") print(f"'>> Review: {example['review_body']}'") show_samples(english_dataset)
'>> Title: Worked in front position, not rear' '>> Review: 3 stars because these are not rear brakes as stated in the item description. At least the mount adapter only worked on the front fork of the bike that I got it for.' '>> Title: meh' '>> Review: Does it’s job and it’s gorgeous but mine is falling apart, I had to basically put it together again with hot glue' '>> Title: Can\'t beat these for the money' '>> Review: Bought this for handling miscellaneous aircraft parts and hanger "stuff" that I needed to organize; it really fit the bill. The unit arrived quickly, was well packaged and arrived intact (always a good sign). There are five wall mounts-- three on the top and two on the bottom. I wanted to mount it on the wall, so all I had to do was to remove the top two layers of plastic drawers, as well as the bottom corner drawers, place it when I wanted and mark it; I then used some of the new plastic screw in wall anchors (the 50 pound variety) and it easily mounted to the wall. Some have remarked that they wanted dividers for the drawers, and that they made those. Good idea. My application was that I needed something that I can see the contents at about eye level, so I wanted the fuller-sized drawers. I also like that these are the new plastic that doesn\'t get brittle and split like my older plastic drawers did. I like the all-plastic construction. It\'s heavy duty enough to hold metal parts, but being made of plastic it\'s not as heavy as a metal frame, so you can easily mount it to the wall and still load it up with heavy stuff, or light stuff. No problem there. For the money, you can\'t beat it. Best one of these I\'ve bought to date-- and I\'ve been using some version of these for over forty years.'
english_dataset.set_format("pandas") english_df = english_dataset["train"][:] # Show counts for top 20 products english_df["product_category"].value_counts()[:20]
home 17679 apparel 15951 wireless 15717 other 13418 beauty 12091 drugstore 11730 kitchen 10382 toy 8745 sports 8277 automotive 7506 lawn_and_garden 7327 home_improvement 7136 pet_products 7082 digital_ebook_purchase 6749 pc 6401 electronics 6186 office_product 5521 shoes 5197 grocery 4730 book 3756 Name: product_category, dtype: int64
def filter_books(example): return ( example["product_category"] == "book" or example["product_category"] == "digital_ebook_purchase" )
english_dataset.reset_format()
spanish_books = spanish_dataset.filter(filter_books) english_books = english_dataset.filter(filter_books) show_samples(english_books)
'>> Title: I\'m dissapointed.' '>> Review: I guess I had higher expectations for this book from the reviews. I really thought I\'d at least like it. The plot idea was great. I loved Ash but, it just didnt go anywhere. Most of the book was about their radio show and talking to callers. I wanted the author to dig deeper so we could really get to know the characters. All we know about Grace is that she is attractive looking, Latino and is kind of a brat. I\'m dissapointed.' '>> Title: Good art, good price, poor design' '>> Review: I had gotten the DC Vintage calendar the past two years, but it was on backorder forever this year and I saw they had shrunk the dimensions for no good reason. This one has good art choices but the design has the fold going through the picture, so it\'s less aesthetically pleasing, especially if you want to keep a picture to hang. For the price, a good calendar' '>> Title: Helpful' '>> Review: Nearly all the tips useful and. I consider myself an intermediate to advanced user of OneNote. I would highly recommend.'
from datasets import concatenate_datasets, DatasetDict books_dataset = DatasetDict() for split in english_books.keys(): books_dataset[split] = concatenate_datasets( [english_books[split], spanish_books[split]] ) books_dataset[split] = books_dataset[split].shuffle(seed=42) # Peek at a few examples show_samples(books_dataset)
'>> Title: Easy to follow!!!!' '>> Review: I loved The dash diet weight loss Solution. Never hungry. I would recommend this diet. Also the menus are well rounded. Try it. Has lots of the information need thanks.' '>> Title: PARCIALMENTE DAÑADO' '>> Review: Me llegó el día que tocaba, junto a otros libros que pedí, pero la caja llegó en mal estado lo cual dañó las esquinas de los libros porque venían sin protección (forro).' '>> Title: no lo he podido descargar' '>> Review: igual que el anterior'
books_dataset = books_dataset.filter(lambda x: len(x["review_title"].split()) > 2)
from transformers import AutoTokenizer model_checkpoint = "google/mt5-small" tokenizer = AutoTokenizer.from_pretrained(model_checkpoint)
inputs = tokenizer("I loved reading the Hunger Games!") inputs
{'input_ids': [336, 259, 28387, 11807, 287, 62893, 295, 12507, 1], 'attention_mask': [1, 1, 1, 1, 1, 1, 1, 1, 1]}
tokenizer.convert_ids_to_tokens(inputs.input_ids)
['▁I', '▁', 'loved', '▁reading', '▁the', '▁Hung', 'er', '▁Games', '</s>']
max_input_length = 512 max_target_length = 30 def preprocess_function(examples): model_inputs = tokenizer( examples["review_body"], max_length=max_input_length, truncation=True ) # Set up the tokenizer for targets with tokenizer.as_target_tokenizer(): labels = tokenizer( examples["review_title"], max_length=max_target_length, truncation=True ) model_inputs["labels"] = labels["input_ids"] return model_inputs
tokenized_datasets = books_dataset.map(preprocess_function, batched=True)
generated_summary = "I absolutely loved reading the Hunger Games" reference_summary = "I loved reading the Hunger Games"
!pip install rouge_score
import evaluate rouge_score = evaluate.load("rouge")
scores = rouge_score.compute( predictions=[generated_summary], references=[reference_summary] ) scores
{'rouge1': AggregateScore(low=Score(precision=0.86, recall=1.0, fmeasure=0.92), mid=Score(precision=0.86, recall=1.0, fmeasure=0.92), high=Score(precision=0.86, recall=1.0, fmeasure=0.92)), 'rouge2': AggregateScore(low=Score(precision=0.67, recall=0.8, fmeasure=0.73), mid=Score(precision=0.67, recall=0.8, fmeasure=0.73), high=Score(precision=0.67, recall=0.8, fmeasure=0.73)), 'rougeL': AggregateScore(low=Score(precision=0.86, recall=1.0, fmeasure=0.92), mid=Score(precision=0.86, recall=1.0, fmeasure=0.92), high=Score(precision=0.86, recall=1.0, fmeasure=0.92)), 'rougeLsum': AggregateScore(low=Score(precision=0.86, recall=1.0, fmeasure=0.92), mid=Score(precision=0.86, recall=1.0, fmeasure=0.92), high=Score(precision=0.86, recall=1.0, fmeasure=0.92))}
scores["rouge1"].mid
Score(precision=0.86, recall=1.0, fmeasure=0.92)
!pip install nltk
import nltk nltk.download("punkt")
from nltk.tokenize import sent_tokenize def three_sentence_summary(text): return "\n".join(sent_tokenize(text)[:3]) print(three_sentence_summary(books_dataset["train"][1]["review_body"]))
'I grew up reading Koontz, and years ago, I stopped,convinced i had "outgrown" him.' 'Still,when a friend was looking for something suspenseful too read, I suggested Koontz.' 'She found Strangers.'
def evaluate_baseline(dataset, metric): summaries = [three_sentence_summary(text) for text in dataset["review_body"]] return metric.compute(predictions=summaries, references=dataset["review_title"])
import pandas as pd score = evaluate_baseline(books_dataset["validation"], rouge_score) rouge_names = ["rouge1", "rouge2", "rougeL", "rougeLsum"] rouge_dict = dict((rn, round(score[rn].mid.fmeasure * 100, 2)) for rn in rouge_names) rouge_dict
{'rouge1': 16.74, 'rouge2': 8.83, 'rougeL': 15.6, 'rougeLsum': 15.96}
from transformers import TFAutoModelForSeq2SeqLM model = TFAutoModelForSeq2SeqLM.from_pretrained(model_checkpoint)
from huggingface_hub import notebook_login notebook_login()
from transformers import DataCollatorForSeq2Seq data_collator = DataCollatorForSeq2Seq(tokenizer, model=model, return_tensors="tf")
tokenized_datasets = tokenized_datasets.remove_columns( books_dataset["train"].column_names )
features = [tokenized_datasets["train"][i] for i in range(2)] data_collator(features)
{'attention_mask': tensor([[1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]]), 'input_ids': tensor([[ 1494, 259, 8622, 390, 259, 262, 2316, 3435, 955, 772, 281, 772, 1617, 263, 305, 14701, 260, 1385, 3031, 259, 24146, 332, 1037, 259, 43906, 305, 336, 260, 1, 0, 0, 0, 0, 0, 0], [ 259, 27531, 13483, 259, 7505, 260, 112240, 15192, 305, 53198, 276, 259, 74060, 263, 260, 459, 25640, 776, 2119, 336, 259, 2220, 259, 18896, 288, 4906, 288, 1037, 3931, 260, 7083, 101476, 1143, 260, 1]]), 'labels': tensor([[ 7483, 259, 2364, 15695, 1, -100], [ 259, 27531, 13483, 259, 7505, 1]]), 'decoder_input_ids': tensor([[ 0, 7483, 259, 2364, 15695, 1], [ 0, 259, 27531, 13483, 259, 7505]])}
tf_train_dataset = tokenized_datasets["train"].to_tf_dataset( columns=["input_ids", "attention_mask", "labels"], collate_fn=data_collator, shuffle=True, batch_size=8, ) tf_eval_dataset = tokenized_datasets["validation"].to_tf_dataset( columns=["input_ids", "attention_mask", "labels"], collate_fn=data_collator, shuffle=False, batch_size=8, )
from transformers import create_optimizer import tensorflow as tf # The number of training steps is the number of samples in the dataset, divided by the batch size then multiplied # by the total number of epochs. Note that the tf_train_dataset here is a batched tf.data.Dataset, # not the original Hugging Face Dataset, so its len() is already num_samples // batch_size. num_train_epochs = 8 num_train_steps = len(tf_train_dataset) * num_train_epochs model_name = model_checkpoint.split("/")[-1] optimizer, schedule = create_optimizer( init_lr=5.6e-5, num_warmup_steps=0, num_train_steps=num_train_steps, weight_decay_rate=0.01, ) model.compile(optimizer=optimizer) # Train in mixed-precision float16 tf.keras.mixed_precision.set_global_policy("mixed_float16")
from transformers.keras_callbacks import PushToHubCallback callback = PushToHubCallback( output_dir=f"{model_name}-finetuned-amazon-en-es", tokenizer=tokenizer ) model.fit( tf_train_dataset, validation_data=tf_eval_dataset, callbacks=[callback], epochs=8 )
from tqdm import tqdm import numpy as np all_preds = [] all_labels = [] for batch in tqdm(tf_eval_dataset): predictions = model.generate(**batch) decoded_preds = tokenizer.batch_decode(predictions, skip_special_tokens=True) labels = batch["labels"].numpy() labels = np.where(labels != -100, labels, tokenizer.pad_token_id) decoded_labels = tokenizer.batch_decode(labels, skip_special_tokens=True) decoded_preds = ["\n".join(sent_tokenize(pred.strip())) for pred in decoded_preds] decoded_labels = ["\n".join(sent_tokenize(label.strip())) for label in decoded_labels] all_preds.extend(decoded_preds) all_labels.extend(decoded_labels)
result = rouge_score.compute( predictions=decoded_preds, references=decoded_labels, use_stemmer=True ) result = {key: value.mid.fmeasure * 100 for key, value in result.items()} {k: round(v, 4) for k, v in result.items()}
from transformers import pipeline hub_model_id = "huggingface-course/mt5-small-finetuned-amazon-en-es" summarizer = pipeline("summarization", model=hub_model_id)
def print_summary(idx): review = books_dataset["test"][idx]["review_body"] title = books_dataset["test"][idx]["review_title"] summary = summarizer(books_dataset["test"][idx]["review_body"])[0]["summary_text"] print(f"'>>> Review: {review}'") print(f"\n'>>> Title: {title}'") print(f"\n'>>> Summary: {summary}'")
print_summary(100)
'>>> Review: Nothing special at all about this product... the book is too small and stiff and hard to write in. The huge sticker on the back doesn’t come off and looks super tacky. I would not purchase this again. I could have just bought a journal from the dollar store and it would be basically the same thing. It’s also really expensive for what it is.' '>>> Title: Not impressed at all... buy something else' '>>> Summary: Nothing special at all about this product'
print_summary(0)
'>>> Review: Es una trilogia que se hace muy facil de leer. Me ha gustado, no me esperaba el final para nada' '>>> Title: Buena literatura para adolescentes' '>>> Summary: Muy facil de leer'