The Week After: Game of Thrones S07 E06 Tweets and Press Reviews Analysis

The Week After: Game of Thrones S07 E06 Tweets and Press Reviews Analysis

Another week is gone, another "Game of Thrones" episode watched, only one left until the end of the 7th series.
The "incident" in Spain, with the episode released for few hours on Wednesday screwed all my plans to do a time-wise comparison between episodes across several countries.

Leaked

I was then forced to think about a new action plan in order avoid disappointing all the fans who enjoyed my previous blog post about the episode 5. What you'll read in today's analysis is based on the same technology as before: Kafka Connect source from Twitter and Sink to BigQuery with Tableau analysis on top.

Overall Flow

What I changed in the meantime is the data structure setup: in the previous part there was a BigQuery table rm_got containing #GoT tweets, an Excel table containing Keywords for each character together with the Name and the Family (or House). Finally there was a view on top of BigQuery rm_got table extracting all the words of each tweet in order to analyse their sentiment.
For this week analysis I tried to optimise the dataflow, mainly pushing data into BigQuery, and I added a new part to it: online press reviews analysis!

Optimization

As mentioned during my previous post, the setup described before was miming an analyst workflow, without writing access to datasource. However it was far from optimal performance wise, since there was a cartesian join between two data-sources, meaning that for every query all the dataset was extracted from BigQuery and then joined in memory in Tableau even if filters for a specific character were included.

The first change was pushing the characters Excel data in BigQuery, so at least we could use the same datasource joins instead of relying on Tableau's data-blend. This has the immediate benefit of running joins and filters in the datasource rather than retrieving all data and filtering locally in memory.
Pushing Excel data into BigQuery is really easy and can be done directly in the web GUI, we just need to transform the data in CSV which is one of allowed input data formats.

BigQuery Input Data Formats

Still this modification alone doesn't resolve the problem of the cartesian join between characters (stored in rm_characters) and the main rm_got table since also BigQuery native joining conditions don't allow the usage of the CONTAIN function we need to verify that the character Key is contained in the Tweet's Text.
Luckily I already had the rm_words view, used in the previous post, splitting the words contained in the Tweet Text into multiple rows. The view contained the Tweet's Id and could be joined with the characters data with a = condition.

However my over simplistic first implementation of the view was removing only # and @ characters from the Tweet text, leaving all the others punctuation signs in the words as you can see in the image below.

Word 1st Wiew

I replaced the old rm_words view code with the following

SELECT  id, TEXT, SPLIT(REGEXP_REPLACE(REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'),'[^a-zA-Z]',' '),' ')  f0__group.word FROM [big-query-ftisiot:BigQueryFtisiotDataset.rm_got] 

Which has two benefits:

  • REPLACE(UPPER(TEXT), 'NIGHT KING', 'NIGHTKING'): Since I'm splitting words, I don't want to miss references to the Night King which is composed by two words that even if written separated point the same character.
  • REGEXP_REPLACE(..,'[^a-zA-Z]',' '): Replaces using regular expression, removing any character apart from the letters A-Z in lower and upper case from the Tweet Text.

The new view definition provides a clean set of words that can finally be joined with the list of characters keys. The last step I did to prepare the data was to create an unique view containing all the fields I was interested for my analysis with the following code:

SELECT
  rm_got.Id,
  rm_got.Text,
  rm_got.CreatedAt,
  [...]
  characters.Key,
  characters.Name,
  characters.Family
FROM
  [DataSet.rm_got] AS rm_got JOIN
  [DataSet.rm_words] AS rm_words ON rm_got.id=rm_words.id JOIN 
  (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) AS rm_words_char ON rm_got.id=rm_words_char.id JOIN 
  [DataSet.rm_charachters] AS characters ON rm_words_char.f0__group.word = characters.Key

Two things to notice:

  • The view rm_words is used two times: one, as mentioned before, to join the Tweet with the character data and one to show all the words contained in a tweet.
  • The (SELECT * FROM FLATTEN([DataSet.rm_words],f0__group.word)) subselect is required since word column, contained in rm_words, was a repeated field, that can't be used in joining condition if not flatten.

RepeatedField

Please note that the SQL above will still duplicate the Tweet rows, in reality we'll have a row for each word and different character Key contained in the Text itself. Still this is a big improvement from the cartesian join we used in our first attempt.

One last mention to optimizations: currently the sentence and word sentiment is calculated on the fly in Tableau using the SCRIPT_INT function. This means that data is extracted from BigQuery into Tableau, then passed to R (running locally in my pc) which computes the score and then returns it to Tableau. In order to optimize Tableau performance I could pre-compute the scores in R and push them in a BigQuery Table but this would mean a pre-processing step that I wanted to avoid since a real-time analysis was one of my purposes.

Tweet Analysis

With my tidy dataset in place, I can now start the analysis and, as the previous week I can track various KPIs like the mentions by character Family and Name. To filter only current week data I created two parameters Start Date of Analysis and End Date of Analysis

Start and End Date

Using those parameters I can filter which days I want to include in my analysis. To apply the filter in the Workbook/Dashboard I created also a column Is Date of Analysis with the following formula

IIF(DATE([CreatedAt]) >= [Start Date of Analysis] 
AND DATE([CreatedAt]) <= [End Date of Analysis]
,'Yes','No')

I can now use the Is Date of Analysis column in my Workbooks and filter the Yes value to retain only the selected dates.

I built a dashboard containing few of the analysis mentioned in my previous blog post, in which I can see the overall scatterplot of characters by # of Tweets and Sentence Sentiment and click on one of them to check its details regarding the most common words used and sentence sentiment.

Overall Sentiment

From the scatterplot on top we can see a change of leadership in the # of Tweets with Daenerys overtaking Jon by a good margin, saving him and in the meantime loosing one of the three dragons was a touching moment in the episode. When clicking on Daenerys we can see that the world WHITE is driving also the positive sentiment.

Daenerys

The Night King keep its leadership on the Sentiment positive side. Also in this case the WHITE word being the most used with positive sentiment. On the other side Arya overtook Sansa as character with most negative mentions. When going in detail on The positive/negative words, we can clearly see that STARK (mentioned in previous episode), KILL, WRONG and DEATH are driving the negative sentiment. Interesting is also the word WEAR with negative sentiment (from Google dictionary "damage, erode, or destroy by friction or use.").

Arya Positive Negative

A cut down version of the workbook with a limited dataset, visible in the image below, is available in Tableau Public.

Overall Dashboard

Game of Couples

Couples

This comparison is all what I promised towards the end of my first post, so I could easily stop here. However as curious person and #GoT fan myself I wanted to know more about the dataset and in particular analyse how character interaction affect sentiment. To do so I had somehow to join characters together if they were mentioned in the same tweet, luckily enough my dataset contained the character mentioned and the list of words of each Tweet. I can reuse the list of words on a left join with the list of characters keys. In this way I have a record for each couple of characters mentioned in a Tweet.

Left Join

I can then start analysing the Tweets mentioning any couple of characters, with the # of Tweets driving the gradient. As you can see I removed the values where the column and row is equal (e.g. Arya and Arya). The result, as expected, is a symmetric matrix since the # of Tweets mentioning Arya and Sansa is the same as the ones mentioning Sansa and Arya.

Colour tweets by couple

We can clearly see that Jon and Daenerys are the most mentioned couple with Sansa and Arya following and in third place Whitewalkers and Bran. This view and the insights we took from it could be problematic to get in cases when the reader is colour blind or has troubles when defining intensity. For those cases a view like the below provides the same information (by only switching the # of Tweets column from Color to Size), however it has the drawback that small squares are hard to see.

Characters Squares

The next step in my "couple analysis" is understand sentiment, and how a second character mentioned in the same tweet affects the positive/negative score of a character. The first step I did is showing the same scatterplot as before, but filtered for a single character, in this case Arya.

Arya Scatterplot

The graph shows Arya's original position, and how the Sentiment and the # of Tweets change the position when another character is included in the Tweet. We can see that, when mentioned with Daenerys the sentiment is much more positive, while when mentioned with Bran or Littlefinger the sentiment remains almost the same.

This graph it's very easy to read, however it has the limitation of being able to display only one character behaviour at time (in this case Arya). What I wanted is to show the same pattern across all characters in a similar way as when analysing the # of Tweets per couple. To do so I went back to a matrix stile of visualization, setting the colour based on positive (green) or negative (red) sentiment.

Overall Character Sentiment

As before the matrix is symmetric, and provides us a new set of insights. For example, when analysing Jorah Mormont, we can see that a mention together with Cercei is negative which we can somehow expect due to the nature of the queen. What's strange is that also when Jorah is mentioned with Samwell Tarly there is a negative feeling. Looking deeply in the data we can see that it's due to a unique tweet containing both names with a negative sentiment score.

Jorah Same

What's missing in the above visualization is an indication on how "strong" is the relationship between two character based on the # of Tweets where they are mentioned together. We can add this by including the # of Tweets as position of the sentiment square. The more the square is moved towards the right the higher is the # of Tweets mentioning the two characters together.

Dan-Jon

We can see as before that Jorah and Sam have a negative feeling when mentioned together, but it's not statistically significant because the # of Tweets is very limited (square position completely on the left). Another example is Daenerys and Jon which have a lot of mentions together with a neutral sentiment. As we saw before also the couple Arya and Bran when mentioned together have a negative feeling, with a limited number Tweets mentioning them together. However Bran mentioned with WhiteWalkers has a strong positive sentiment.

It's worth mentioning that the positioning of the dot is based on a uniform scale across the whole matrix. This means that if, like in our case, there is a dominant couple (Daenerys and Jon) mentioned by a different order of magnitude of # of Tweets compared to all other couples, the difference in positioning of all the others dots will be minimal. This could however be solved using a logarithmic scale.

Web Scraping

Warning: all the analysis done in the article including this chapter are performed with automated tools. Due to the nature of the subject (a TV series plenty of deaths, battles and thrilling scenes) the words used to describe a sentence could be automatically classified as positive/negative. This doesn't automatically mean that the opinion of the writer is either positive or negative about the scene/episode/series.

The last part of the analysis I had in mind was about comparing the Tweets sentiment, with the same coming from the episode reviews that I could find online. This latter part relies a lot on the usage of R to scrape the relevant bits from the web-pages, the whole process was:

  • Search on Google for Beyond the Wall Reviews
  • Take the top N results
  • Scrape the review from the webpage
  • Tokenize the review in sentences
  • Assign the sentence score using the same method as in Tableau
  • Tokenize the sentence in words
  • Upload the data into BigQuery for further analysis

Few bits on the solution I've used to accomplish this since the reviews are coming from different websites with different tags, classes and Ids, I wasn't able to write a general scraper for all websites. However each review webpage I found had the main text divided in multiple <p> tags under a main <div> tag which had an unique Id or class. The R code simply listed the <div> elements, found the one mentioning the correct Id or class and took all the data contained inside the <p> elements. A unique function is called with three parameters: website, Id or class to look for, and SourceName (e.g. Telegraph). The call to the function is like

sentence_df <- scrapedata("http://www.ign.com/articles/2017/08/21/game-of-thrones-beyond-the-wall-review",'Ign',"article-content")

It will return a dataframe containing one row per <p> tag, together with a mention of the source (Ign in this case).

The rest of the R code tokenizes the strings and the words using the tokenizers package and assigns the related sentiment score with the syuzhet package used in my previous blog post. Finally it creates a JSON file (New Line Delimited) which is one of the input formats accepted by BigQuery.
When the data is in BigQuery, the analysis follows the same approach as before with Tableau connecting directly to BigQuery and using again R for word sentiment scoring.

The overall result in Tableau includes a global Episode sentiment score by Source, the usual scatterplot by character and the same by Source. Each of the visualizations can act as filter for the others.

Press Reviews

We can clearly see that AVClub and Indiewire had opposite feelings about the episode. Jon Snow is the most mentioned character with Arya and Sansa overtaking Daenerys.

The AVClub vs Indiewire scoring can be explained by the sencence sentiment categorization. Indiewire had most negative sentences (negative evaluations) while the distribution of AVClub has its peak on the 1 (positive) value.

Sentiment by Source

Checking the words used in the two Sources we can notice as expected a majority of positive for AVClub while Indiewire has the overall counts almost equal.

Count of positive and negative words

Going in detail on the words, we can see the positive sentiment of AVClub being driven by ACTION, SENSE, REUNION while Indiewire negative one due to ENEMY, BATTLE, HORROR.

Bubbles

This is the automated overall sentiment analysis, if we read the two articles from Indiewire and AVClub in detail we can see that the overall opinion is not far from the automated score:

From AVClub

On the level of spectacle, “Beyond The Wall” is another series high point, with stellar work ....

From IdieWire

Add to the list “Beyond the Wall,” an episode that didn’t have quite the notable body count that some of those other installments did

To be fair we also need to say that IdieWire article is focused on the war happening and the thrilling scene with the Whitewalkers where words like ENEMY, COLD, BATTLE, DEATH which have a negative sentiment are actually only used to describe the scene and not the feelings related to it.

Character and Review Source Analysis

The last piece of analysis is related to single characters. As mentioned before part of the dashboard built in Tableau included the Character scatterplot and the Source scatterplot. By clicking on a single Character I can easily filter the Source scatterplot, like in this case for Daenerys.

Daenerys-By-Source

We can see how different Sources have different average sentiment score for the same character, in this case with Mashable being positive while Pastemagazine negative.

Danerys Mentions

Checking the words mentioned we can clearly see a positive sentiment related to PRESENT, AGREED and RIDER for Mashable while the negative sentiment of Pastemagazine is driven by FIGHT, DANGER, LOOSING. As said before just few words of difference describing the same scene can make the difference.

Finally, one last sentence for the very positive sentiment score for Clegor Clegaine: it is partially due to the reference to his nickname, the Mountain, which is used as Key to find references. The mountain is contained in a series of sentences as reference to the place where the group of people guided by Jon Snow are heading in order to find the Whitewalkers. We could easily remove MOUNTAIN from the Keywords to eliminate the mismatch.

Mountain

We are at the end of the second post about Game of Thrones analysis with Tableau, BigQuery and Kafka. Hope you didn't get bored...see you next week for the final episode of the series! And please avoid waking up with blue eyes!

via GIPHY