Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?

NewImage

In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute - these map well onto Pig tuples and bags respectively.

NewImage

What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands - or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?

In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:

CREATE TABLE tweet_data(
  interactionId string,
  username string,
  content string,
  author_followers int)
ROW FORMAT SERDE 
  'com.mongodb.hadoop.hive.BSONSerDe' 
STORED BY 
  'com.mongodb.hadoop.hive.MongoStorageHandler' 
WITH SERDEPROPERTIES ( 
  'mongo.columns.mapping'='{"interactionId":"interactionId",
  "username":"interaction.interaction.author.username",
  "content":\"interaction.interaction.content",
  "author_followers_count":"interaction.twitter.user.followers_count"}'
  )
TBLPROPERTIES (
  'mongo.uri'='mongodb://cdh51-node1:27017/datasiftmongodb.rm_tweets'
  )

And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.

NewImage

Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.

NewImage

Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.

NewImage

Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format - trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.

NewImage

Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.

NewImage

And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.

NewImage

Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.