This year, Rittman Mead were the Analytics Sponsor for the UKOUG Tech13 conference in Manchester, and for those that visited the UKOUG stand during their time there will have noticed the Rittman Mead-sponsored Analytics Dashboard on display. In this blog post I will cover how it was put together, the "Blue Peter Way" !
For those of you not familiar with Blue Peter it is a long running children's TV show here in the UK that started way back in 1958, possibly it's most infamous moment came in 1969 with lulu the defecating elephant. As a child the highlight of the show was the "How to make" something section where they would always use the phrase "Sticky-backed-plastic" instead of "Sellotape" due to a policy against using commercial terms on air. The presenters would create something from scratch with bits and bobs you could find around the house, cereal boxes, egg cartons, washing up bottles, Sellotape etc ( sorry, "sticky-backed-plastic" ). That's exactly what I'm going to be doing here but instead of making a sock monster I'm going to show you how the analytics dashboard was created from scratch with bits you can find on the internet. So kids, without further delay, let's begin...
(remember to ask your parents permission before downloading any of the following items)
You will need :
- A Linux server.
- A Web server
- The Redis key/value store
- A DataSift account
- The Webdis HTTP server
- The tagcanvas.js jQuery plugin
- The vTicker jQuery plugin
- The flot plotting library
- Some Sticky-backed-Plastic (jQuery)
- Lots of coffee
The Linux server I'm using is Red Hat Enterprise Server 6.4 and the very first thing we'll need to do is install a web server.
Done. The Web server is now installed, configured to start on server-boot and up and running ready to service our http requests.
Next up is the Redis key/value datastore. I'll be using Redis to store all incoming tweets from our datasource ( more on that in a bit )
Now that we have Redis up and running let's perform a couple of tests, first a benchmark using the Redis-benchmark command
This command throws out a lot more output than this but it is the LRANGE command we are particularly interested in as we'll be using it to retrieve our tweets later on. 3593 requests per second seems reasonable to me, there were around a 1000 registrations for the UKOUG TECH13 conference, the likelihood of each of them making 3 concurrent dashboard requests all within the same second is slim to say the least - regardless, I'm willing to live with the risk. Now for a quick SET/GET test using the Redis-cli command.
Ok so that's our datastore sorted, but what about our datasource ? Clearly we will be using Twitter as our primary data source but mere mortals like myself don't have access to Twitter's entire data stream, for that we need to turn to specialist companies that do. DataSift, amongst other companies like Gnip and Topsy ( which interestingly was bought by Apple earlier this week ) can offer this service and will add extra features into the mix such as sentiment analysis and stream subscription services that push data to you. I'll be using DataSift, as here at Rittman Mead we already have an account with them. The service itself charges you by DPU ( Data Processing Units ) the cost of which depends upon the computational complexity of the stream your running, suffice to say that running the stream to feed the analytics dashboard for a few days was pretty cheap.
To setup a stream you simply express what you want included from the Twitter firehose using their own Curated Stream Definition Language, CSDL. The stream used for the dashboard was very simple and the CSDL looked like this :-
This is simply searching for tweets containing the string "ukoug" from the Twitter stream. DataSift supports all kinds of other social data, for a full list head over to their website.
Now that we have our data stream setup, how do we actually populate our Redis data store with it ? Simple - get DataSift to push it to us. Using their own PUSH API you can setup a subscription in which you can specify the following output parameters: "output_type", "host", "port", "delivery_frequency" and "list" amongst many others. The output type was set to "Redis", the host, well, our hostname and the port set to the port upon which Redis is listening, by default 6379. The delivery_frequency was set to 60 seconds and the list is the name of the Redis list key you want the data pushed to. With all that setup and the subscription active, tweets will automagically start arriving in our Redis datastore in the JSON format - happy days !
The next step is to get the data from Redis to the browser. I could install PHP and configure it to work with apache and then install one of the several Redis-PHP libraries and write some backend code to serve data up to the browser, but time is limited and I don't want to faff around with all that. Besides this is Blue Peter not a BBC4 Documentary. Here's where we use the next cool piece of open source software, Webdis. Webdis acts as a HTTP interface to Redis which means I can make HTTP calls directly from the browser like this :-
Yep, you guessed it, we've just set a key in Redis directly from the browser. To retrieve it we simply do this.
Security can be handled in the Webdis config file by setting IP based rules to stop any Tom, Dick or Harry from modifying your Redis keys. So to install Webdis we do the following :-
An that's it, we now have tweet to browser in 60 seconds and without writing a single line of code ! Here's an overview of the architecture we have :-
Would I do this in a production environment with larger data set, probably not, it wouldn't scale, I'd instead write server-side code to handle the data processing, test performance and push only the required data to the client. The right tools for the right job.
The function getData() is called within the setInterval() function, the second parameter to the setInterval() sets the frequency of the call, in this case every 5 mins (30000 milliseconds). The getData function performs an Ajax get request on the url that points to our Webdis server listening on port 7379 which then performs an LRANGE command on the Redis data store, it's simply asking Redis to return all list items between 0 and 100000, each item being a single tweet. Once the Ajax request has successfully completed the "success" callback is called and within this callback I am pushing each tweet into an array so we end up with an array of tweet objects. We now have all the data in a format we can manipulate to our hearts content.
Now onto the Graphs and Visualisations.
The Tweet Ticker
The tweet ticker was built by grabbing the latest 30 tweets from the data array and assigning them to html <li> tags, the vTicker jQuery plugin was then applied to the containing <ul> html tag. Various plugin options allow you to control things like the delay and scroll speed.
Top 10 Speakers by Twitter Mention
This one proved quite popular - these speaker types are a competitive bunch ! Having obtained a list of speakers from the UKOUG Tech13 website I was able to search all the twitter content for each speaker and aggregate up to get the total twitter mentions for each speaker, again the graph was rendered using the flot plotting library. As the graph updated during each day speakers were swapping places with our own Mark Rittman tweeting out the "The Scores on the doors" at regular intervals. When the stats can me manipulated though there's always someone willing to take advantage !
tut, tut Mark.
The twitter Avatars used the tagcavas.js library but instead of populating it with words from the Twitter content the Tweet avatars were used. A few changes to the plugin options were made to display the results as a horizontally scrolling cylinder instead of a globe.
The Twitter sentiment graph again used flot. Tweet sentiment was graphed over time for the duration of the conference. The sentiment score was provided by DataSift as part of the Twitter payload. The scores we received as part of the 3500 tweets ranged between -15 and 15 each score reflected either a positive or negative tweet. Asking a computer to infer a human emotion from 140 characters of text is a tough ask. Having looked at the data in detail a fair few of the tweets that received negative scores weren't negative in nature, for example tweets with the content "RAC attack" and "Dangerous anti patterns" generated a cluster of negative scores. As we know computers are not as clever as humans, how can they be aware of the context of a tweet? detect sarcasm or differentiate between banter and plain old insults? Not all the negatively scored tweets where false-negatives, some were genuine, a few regarding the food seemed to ring true.
Perhaps the data you're analyzing needs to be taken into context as a whole. You'd expect a 1000 Techies running around a tech conference to be happy, the sentiment analysis seemed to do a better job at ranking how positive tweets were than how negative they were. Perhaps from a visualisation point of view, a logarithmic scale along with a multiplier to raise the lowest score would have worked better in this case to reflect how positive overall the event was. One thing is clear though and that is that further statistical analysis would be needed over a much larger data set to really gain insight into how positive or negative your data set is.
So that's it kids, I hope you've enjoyed this episode of Blue Peter, until next time….
Edit - 6-Dec-2013 19.07:
Mark has asked me the following question over twitter:
"any chance you could let us know why these particular tools / components were chosen?".
I thought I'd give my answer here.
All the tools were also open source, easy to install/configure and are well documented, I had also used them all previously - again, a time saver. Redis was chosen for 2 reasons, it was supported as a subscription destination by DataSift ( along with many others ) and I'm currently using it in another development project I'm working on so I was up to speed with it. Although in this solution I'm not really taking advantage of Redis's access speed it worked well as somewhere to persist the data.
If I was coding a more permanent solution with more development time then I'd add a relational database into the mix and use it to perform all of the data aggregation and analysis, this would make the solution more scalable. I'd then either feed the browser via ajax calls directly to the database via backend code or populate a cache layer from the database and make ajax called directly on the cache, similar to what I did in this solution. It would have been nice to use D3 to develop a more elaborate visualisation instead of the canned flot charts but again this would have taken more time to develop.