Visualizing 1.4 Millions Game of Thrones Words with BERT
This past weekend while watching Game of Thrones at dinner — I had a thought — The thought of visualizing all the texts of GOT books with mightly BERT (BiDirectional Encoding Representation from Transformers).
How can we achieve this —
- First — we will extract the BERT embeddings for each word in all GOT books.
2. Then — reduce the dimension of BERT embeddings to visualize it in 3D
3. and finally — create a web application to visualize it on the browser
Let’s start —
1. Extracting BERT Embeddings for Game of Thrones Books
Extracting BERT embeddings for your custom data can be intimidating at first — but not anymore. Gary Lai has this awesome package bert-embedding which lets you extract token level embeddings without any hassle. The code looks as simple as:
# installing bert-embedding
!pip install bert-embedding# importing bert_embedding
from bert_embedding import BertEmbedding# text to be encoded
text = """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
"""# generating sentences
sentences = text.split('\n')# instantiating BerEmbedding class
bert_embedding = BertEmbedding()# passing sentences to bert_embedding model
result = bert_embedding(sentences)
Let’s use this package for our data —
We will extract 5 Game of Thrones books using requests
—
import requestsbook1 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got1.txt"book2 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got2.txt"book3 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got3.txt"book4 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got4.txt"book5 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got5.txt"b1 = requests.get(book1)
b2 = requests.get(book2)
b3 = requests.get(book3)
b4 = requests.get(book4)
b5 = requests.get(book5)book1_content = [sent for sent in b1.text.splitlines() if sent != '']book2_content = [sent for sent in b2.text.splitlines() if sent != '']book3_content = [sent for sent in b3.text.splitlines() if sent != '']book4_content = [sent for sent in b4.text.splitlines() if sent != '']book5_content = [sent for sent in b5.text.splitlines() if sent != '']
- Next — We will clean the content of the books. And we will store the content as a list of sentences
import redef sentence_to_wordlist(raw):
clean = re.sub(“[^a-zA-Z]”,” “, raw)
words = clean.split()
return wordsbook1_sentences = []
for raw_sentence in book1_content:
if len(raw_sentence) > 0:
book1_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book2_sentences = []
for raw_sentence in book2_content:
if len(raw_sentence) > 0:
book2_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book3_sentences = []
for raw_sentence in book3_content:
if len(raw_sentence) > 0:
book3_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book4_sentences = []
for raw_sentence in book4_content:
if len(raw_sentence) > 0:
book4_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book5_sentences = []
for raw_sentence in book5_content:
if len(raw_sentence) > 0:
book5_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))
- Once we have a clean list of sentences for each book, we can extract BERT embeddings using the code below:
# imorting dependenciesfrom bert_embedding import BertEmbedding
from tqdm import tqdm_notebook
import pandas as pd
import mxnet as mx# bert_embedding supports GPU for faster processsing
ctx = mx.gpu(0)# This function will extract BERT embeddings and store it in a
# structured format i.e. dataframe
def generate_bert_embeddings(sentences):
bert_embedding = BertEmbedding(ctx=ctx)
print(“Encoding Sentences:”)
result = bert_embedding(sentences)
print(“Encoding Finished”) df = pd.DataFrame() for i in tqdm_notebook(range(len(result))):
embed = pd.DataFrame(result[i][1])
embed[‘words’] = result[i][0]
df = pd.concat([df, embed])
return dfbook1_embedding = generate_bert_embeddings(book1_sentences)
book2_embedding = generate_bert_embeddings(book2_sentences)
book3_embedding = generate_bert_embeddings(book3_sentences)
book4_embedding = generate_bert_embeddings(book4_sentences)
book5_embedding = generate_bert_embeddings(book5_sentences)
2. Reducing BERT Embeddings dimension using PCA & tSVD
- BERT embeddings are 768 dimension vectors i.e. we have 768 numbers to represent each word or tokens found in the books.
- We will reduce the dimensionality of these words from 768 to 3 — to visualize these tokens/words in 3 dimensions using the code below:
from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA# This function will reduce dimension of the embeddings using tSVD
# and PCAdef reduce_dimension(embedding_df):
# Dimensionality Reduction using tSVD tsvd = TruncatedSVD(n_components=3)
tsvd_3d = pd.DataFrame(tsvd.fit_transform(embedding_df.drop(‘words’, axis=1))) tsvd_3d[‘words’] = embedding_df[‘words’].values
# Dimensionality reduction using PCA
pca = PCA(3) pca_3d = pd.DataFrame(pca.fit_transform(embedding_df.drop(‘words’, axis=1)))
pca_3d[‘words’] = embedding_df[‘words’].values return tsvd_3d, pca_3d
- Let’s apply the above function to our embeddings
tsvd_book1, pca_book1 = reduce_dimension(book1_embedding)
tsvd_book2, pca_book2 = reduce_dimension(book2_embedding)
tsvd_book3, pca_book3 = reduce_dimension(book3_embedding)
tsvd_book4, pca_book4 = reduce_dimension(book4_embedding)
tsvd_book5, pca_book5 = reduce_dimension(book5_embedding)
Voila! Now we have 3 dimension projection of each word in all the GOT books
Complete Code —
Extraction of BERT embeddings and dimensionality reduction can be a time-consuming process — but not for you.
You can download Game of Thrones BERT Embeddings from here: Download
3. Building A Web App to visualize on the Browser
This is the final part of this project. We will build a front end to visualize these embeddings in 3 dimensions in pure python.
To do this, we will use Dash. Dash is a python framework that lets you build beautiful web-based analytical apps in pure python. No JavaScript required.
You can install dash : pip install dash==1.8.0
A Dash application consists of 3parts —
- Dependencies and app instantiation: This section talks about importing dependent packages and starting a Dash app
- Layout: It lets you define how your web application would look like
- Callback: It lets you add interactivity on your charts, visuals or buttons.
We will write all the 3 parts of the Dash app in a single app.py file as below —
and run the app.py file in your terminal as below —
>> python app.py
and you’re done — Now you can explore your GOT characters in 3D.
Thank you so much for reading —