Visualizing 1.4 Millions Game of Thrones Words with BERT

4 min readFeb 24, 2020

This past weekend while watching Game of Thrones at dinner — I had a thought — The thought of visualizing all the texts of GOT books with mightly BERT (BiDirectional Encoding Representation from Transformers).

How can we achieve this —

First — we will extract the BERT embeddings for each word in all GOT books.

2. Then — reduce the dimension of BERT embeddings to visualize it in 3D

3. and finally — create a web application to visualize it on the browser

Let’s start —

1. Extracting BERT Embeddings for Game of Thrones Books

Extracting BERT embeddings for your custom data can be intimidating at first — but not anymore. Gary Lai has this awesome package bert-embedding which lets you extract token level embeddings without any hassle. The code looks as simple as:

# installing bert-embedding
!pip install bert-embedding# importing bert_embedding
from bert_embedding import BertEmbedding# text to be encoded
text = """We introduce a new language representation model called BERT, which stands for Bidirectional Encoder Representations from Transformers.
 Unlike recent language representation models, BERT is designed to pre-train deep bidirectional representations by jointly conditioning on both left and right context in all layers.
"""# generating sentences
sentences = text.split('\n')# instantiating BerEmbedding class
bert_embedding = BertEmbedding()# passing sentences to bert_embedding model
result = bert_embedding(sentences)

Let’s use this package for our data —

We will extract 5 Game of Thrones books using requests —

import requestsbook1 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got1.txt"book2 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got2.txt"book3 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got3.txt"book4 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got4.txt"book5 = “https://raw.githubusercontent.com/llSourcell/word_vectors_game_of_thrones-LIVE/master/data/got5.txt"b1 = requests.get(book1)
b2 = requests.get(book2)
b3 = requests.get(book3)
b4 = requests.get(book4)
b5 = requests.get(book5)book1_content = [sent for sent in b1.text.splitlines() if sent != '']book2_content = [sent for sent in b2.text.splitlines() if sent != '']book3_content = [sent for sent in b3.text.splitlines() if sent != '']book4_content = [sent for sent in b4.text.splitlines() if sent != '']book5_content = [sent for sent in b5.text.splitlines() if sent != '']

Next — We will clean the content of the books. And we will store the content as a list of sentences

import redef sentence_to_wordlist(raw):
    clean = re.sub(“[^a-zA-Z]”,” “, raw)
    words = clean.split()
    return wordsbook1_sentences = []
for raw_sentence in book1_content:
    if len(raw_sentence) > 0:
        book1_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book2_sentences = []
for raw_sentence in book2_content:
    if len(raw_sentence) > 0:
        book2_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book3_sentences = []
for raw_sentence in book3_content:
    if len(raw_sentence) > 0:
        book3_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book4_sentences = []
for raw_sentence in book4_content:
    if len(raw_sentence) > 0:
        book4_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))book5_sentences = []
for raw_sentence in book5_content:
    if len(raw_sentence) > 0:
        book5_sentences.append(' '.join(sentence_to_wordlist(raw_sentence)))

Once we have a clean list of sentences for each book, we can extract BERT embeddings using the code below:

# imorting dependenciesfrom bert_embedding import BertEmbedding
from tqdm import tqdm_notebook
import pandas as pd
import mxnet as mx# bert_embedding supports GPU for faster processsing
ctx = mx.gpu(0)# This function will extract BERT embeddings and store it in a 
# structured format i.e. dataframe
def generate_bert_embeddings(sentences):
    bert_embedding = BertEmbedding(ctx=ctx)
    print(“Encoding Sentences:”)
    result = bert_embedding(sentences)
    print(“Encoding Finished”)    df = pd.DataFrame()    for i in tqdm_notebook(range(len(result))):
        embed = pd.DataFrame(result[i][1])
        embed[‘words’] = result[i][0]
        df = pd.concat([df, embed])
    return dfbook1_embedding = generate_bert_embeddings(book1_sentences)
book2_embedding = generate_bert_embeddings(book2_sentences)
book3_embedding = generate_bert_embeddings(book3_sentences)
book4_embedding = generate_bert_embeddings(book4_sentences)
book5_embedding = generate_bert_embeddings(book5_sentences)

2. Reducing BERT Embeddings dimension using PCA & tSVD

BERT embeddings are 768 dimension vectors i.e. we have 768 numbers to represent each word or tokens found in the books.
We will reduce the dimensionality of these words from 768 to 3 — to visualize these tokens/words in 3 dimensions using the code below:

from sklearn.decomposition import TruncatedSVD
from sklearn.decomposition import PCA# This function will reduce dimension of the embeddings using tSVD 
# and PCAdef reduce_dimension(embedding_df):
# Dimensionality Reduction using tSVD    tsvd = TruncatedSVD(n_components=3)
    tsvd_3d = pd.DataFrame(tsvd.fit_transform(embedding_df.drop(‘words’, axis=1)))    tsvd_3d[‘words’] = embedding_df[‘words’].values
# Dimensionality reduction using PCA
    pca = PCA(3)    pca_3d =       pd.DataFrame(pca.fit_transform(embedding_df.drop(‘words’, axis=1)))
    pca_3d[‘words’] = embedding_df[‘words’].values    return tsvd_3d, pca_3d

Let’s apply the above function to our embeddings

tsvd_book1, pca_book1 = reduce_dimension(book1_embedding)
tsvd_book2, pca_book2 = reduce_dimension(book2_embedding)
tsvd_book3, pca_book3 = reduce_dimension(book3_embedding)
tsvd_book4, pca_book4 = reduce_dimension(book4_embedding)
tsvd_book5, pca_book5 = reduce_dimension(book5_embedding)

Voila! Now we have 3 dimension projection of each word in all the GOT books

Complete Code —

Extraction of BERT embeddings and dimensionality reduction can be a time-consuming process — but not for you.

You can download Game of Thrones BERT Embeddings from here: Download

3. Building A Web App to visualize on the Browser

This is the final part of this project. We will build a front end to visualize these embeddings in 3 dimensions in pure python.

To do this, we will use Dash. Dash is a python framework that lets you build beautiful web-based analytical apps in pure python. No JavaScript required.

You can install dash : pip install dash==1.8.0

A Dash application consists of 3parts —

Dependencies and app instantiation: This section talks about importing dependent packages and starting a Dash app

Dependencies and Dash App Instantiation

Layout: It lets you define how your web application would look like

Dash App Layout

Callback: It lets you add interactivity on your charts, visuals or buttons.

Dash App Callbacks

We will write all the 3 parts of the Dash app in a single app.py file as below —

Dash App Complete app.py file

and run the app.py file in your terminal as below —

>> python app.py

and you’re done — Now you can explore your GOT characters in 3D.

Thank you so much for reading —

Visualizing 1.4 Millions Game of Thrones Words with BERT

Written by Shivanand Roy

No responses yet