Classifying TV Show Quotes using BERT

Nov 25, 2023 text-classification, natural language processing, bert, classification

Github

I worked on an NLP classification project to distinguish quotes from two popular TV shows - Star Wars and Friends. The goal was to train a model to accurately categorize unseen quotes as belonging to one show or the other.

The Adventure Begins…

Every good quest starts with assembling a crew. I needed to gather a dataset of quotes from the two shows. This required venturing out into the wilderness of the internet to scrape relevant quotes from fan sites using selenium and beautifulSoup.

After gathering over 100 quotes from each show, it was time to wrangle the unruly data into a form suitable for ML models. I cleaned and preprocessed the quotes, adding labels to denote the source show. python

quotes['label'] = 0 # Star Wars
quotes['label'] = 1 # Friends

Training Time

Now for the fun part - teaching a model to discern quote origins! I turned to the mighty BERT architecture and leveraged a pretrained model. BERT has oozed NLP knowledge from consuming piles of text data. I just needed to fine-tune it with my quotes.

First, I had to encode the text into numeric tokens that BERT digests:

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

encoded_quotes = tokenizer(quotes, 
                           padding=True,
                           truncation=True,
                           return_tensors='tf')

Then I compiled a BERT model, fitted it on the training quotes, and evaluated it on a held-out test set.

The Journey’s End

After fine-tuning the BERT model on the quote dataset, it was time to evaluate its performance and see how well it learned to classify the quotes. I tested the model on a held-out set of 40 quotes that it had never seen before - 20 Star Wars and 20 Friends. This would give an unbiased estimate of its real-world accuracy. The model achieved an overall test accuracy of 80%! This exceeded my expectations considering the small dataset size. Digging deeper into the results:

Precision for Star Wars quotes was 82% and Friends 78%
Recall for Star Wars was 85% and Friends 75% The confusion matrix showed the model struggled slightly more with Friends quotes - mistaking 5 of them for Star Wars.

This may be because Star Wars quotes have more distinct language like “Use the force” and mentions of “Jedi”. Whereas Friends has more everyday conversational quotes that are harder to classify.

Overall though, BERT was successfully able to learn distinguishing features between the two classes with moderate success.

There is definitely scope to improve accuracy further with:

A larger dataset for better generalization
Trying different pretrained models like RoBERTa
Hyperparameter tuning - learning rate, epochs etc.
Regularization techniques like dropout to reduce overfitting

But as a first attempt, I am pleased with the model’s capabilities. It can classify unseen quotes with 80% accuracy - not bad! This was a fun first adventure in training BERT for text classification. I’m excited to keep exploring how transformers can be applied to NLP tasks going forward.

Until next time, may the force be with you!

NLP

Classifying TV Show Quotes using BERT

The Adventure Begins…

Training Time

The Journey’s End

Gauri Chaudhari

Data Professional