Visualizing 1.4 Millions Game of Thrones Words with BERT

Shivanand Roy
4 min readFeb 24, 2020

--

This past weekend while watching Game of Thrones at dinner — I had a thought — The thought of visualizing all the texts of GOT books with mightly BERT (BiDirectional Encoding Representation from Transformers).

How can we achieve this —

  1. First — we will extract the BERT embeddings for each word in all GOT books.

2. Then — reduce the dimension of BERT embeddings to visualize it in 3D

3. and finally — create a web application to visualize it on the browser

Let’s start —

1. Extracting BERT Embeddings for Game of Thrones Books

Extracting BERT embeddings for your custom data can be intimidating at first — but not anymore. Gary Lai has this awesome package bert-embedding which lets you extract token level embeddings without any hassle. The code looks as simple as:

# installing bert-embedding
!pip install bert-embedding
# importing bert_embedding
from bert_embedding import BertEmbedding
# text to be encoded
text = """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
"""
# generating sentences
sentences = text.split('\n')
# instantiating BerEmbedding class
bert_embedding = BertEmbedding()
# passing sentences to bert_embedding model
result = bert_embedding(sentences)

Let’s use this package for our data —

We will extract 5 Game of Thrones books using requests

import requestsbook1 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got1.txt"book2 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got2.txt"book3 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got3.txt"book4 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got4.txt"book5 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got5.txt"b1 = requests.get(book1)
b2 = requests.get(book2)
b3 = requests.get(book3)
b4 = requests.get(book4)
b5 = requests.get(book5)
book1_content = [sent for sent in b1.text.splitlines() if sent != '']book2_content = [sent for sent in b2.text.splitlines() if sent != '']book3_content = [sent for sent in b3.text.splitlines() if sent != '']book4_content = [sent for sent in b4.text.splitlines() if sent != '']book5_content = [sent for sent in b5.text.splitlines() if sent != '']
  • Next — We will clean the content of the books. And we will store the content as a list of sentences
import redef sentence_to_wordlist(raw):
clean = re.sub(“[^a-zA-Z]”,” “, raw)
words = clean.split()
return words
book1_sentences = []
for raw_sentence in book1_content:
if len(raw_sentence) > 0:
book1_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))
book2_sentences = []
for raw_sentence in book2_content:
if len(raw_sentence) > 0:
book2_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))
book3_sentences = []
for raw_sentence in book3_content:
if len(raw_sentence) > 0:
book3_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))
book4_sentences = []
for raw_sentence in book4_content:
if len(raw_sentence) > 0:
book4_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))
book5_sentences = []
for raw_sentence in book5_content:
if len(raw_sentence) > 0:
book5_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))
  • Once we have a clean list of sentences for each book, we can extract BERT embeddings using the code below:
# imorting dependenciesfrom bert_embedding import BertEmbedding
from tqdm import tqdm_notebook
import pandas as pd
import mxnet as mx
# bert_embedding supports GPU for faster processsing
ctx = mx.gpu(0)
# This function will extract BERT embeddings and store it in a
# structured format i.e. dataframe
def generate_bert_embeddings(sentences):
bert_embedding = BertEmbedding(ctx=ctx)
print(“Encoding Sentences:”)
result = bert_embedding(sentences)
print(“Encoding Finished”)
df = pd.DataFrame() for i in tqdm_notebook(range(len(result))):
embed = pd.DataFrame(result[i][1])
embed[‘words’] = result[i][0]
df = pd.concat([df, embed])
return df
book1_embedding = generate_bert_embeddings(book1_sentences)
book2_embedding = generate_bert_embeddings(book2_sentences)
book3_embedding = generate_bert_embeddings(book3_sentences)
book4_embedding = generate_bert_embeddings(book4_sentences)
book5_embedding = generate_bert_embeddings(book5_sentences)

2. Reducing BERT Embeddings dimension using PCA & tSVD

  • BERT embeddings are 768 dimension vectors i.e. we have 768 numbers to represent each word or tokens found in the books.
  • We will reduce the dimensionality of these words from 768 to 3 — to visualize these tokens/words in 3 dimensions using the code below:
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA
# This function will reduce dimension of the embeddings using tSVD
# and PCA
def reduce_dimension(embedding_df):
# Dimensionality Reduction using tSVD
tsvd = TruncatedSVD(n_components=3)
tsvd_3d = pd.DataFrame(tsvd.fit_transform(embedding_df.drop(‘words’, axis=1)))
tsvd_3d[‘words’] = embedding_df[‘words’].values
# Dimensionality reduction using PCA
pca = PCA(3)
pca_3d = pd.DataFrame(pca.fit_transform(embedding_df.drop(‘words’, axis=1)))
pca_3d[‘words’] = embedding_df[‘words’].values
return tsvd_3d, pca_3d
  • Let’s apply the above function to our embeddings
tsvd_book1, pca_book1 = reduce_dimension(book1_embedding)
tsvd_book2, pca_book2 = reduce_dimension(book2_embedding)
tsvd_book3, pca_book3 = reduce_dimension(book3_embedding)
tsvd_book4, pca_book4 = reduce_dimension(book4_embedding)
tsvd_book5, pca_book5 = reduce_dimension(book5_embedding)

Voila! Now we have 3 dimension projection of each word in all the GOT books

Complete Code —

Extraction of BERT embeddings and dimensionality reduction can be a time-consuming process — but not for you.

You can download Game of Thrones BERT Embeddings from here: Download

3. Building A Web App to visualize on the Browser

This is the final part of this project. We will build a front end to visualize these embeddings in 3 dimensions in pure python.

To do this, we will use Dash. Dash is a python framework that lets you build beautiful web-based analytical apps in pure python. No JavaScript required.

You can install dash : pip install dash==1.8.0

A Dash application consists of 3parts —

  • Dependencies and app instantiation: This section talks about importing dependent packages and starting a Dash app
Dependencies and Dash App Instantiation
  • Layout: It lets you define how your web application would look like
Dash App Layout
  • Callback: It lets you add interactivity on your charts, visuals or buttons.
Dash App Callbacks

We will write all the 3 parts of the Dash app in a single app.py file as below —

Dash App Complete app.py file

and run the app.py file in your terminal as below —

>> python app.py

and you’re done — Now you can explore your GOT characters in 3D.

Thank you so much for reading —

--

--

Shivanand Roy
Shivanand Roy

Written by Shivanand Roy

AI Lead @ Ernst & Young | Sharing Unfiltered Stories of my Life’s Journey