OBIEE SampleApp in The Cloud: Importing VirtualBox Machines to AWS EC2

September 10th, 2014 by

Virtualisation has revolutionised how we work as developers. A decade ago, using new software would mean trying to find room on a real tin server to install it, hoping it worked, and if it didn’t, picking apart the pieces probably leaving the server in a worse state than it was to begin with. Nowadays, we can just launch a virtual machine to give a clean environment and if it doesn’t work – trash it and start again.
The sting in the tail of virtualisation is that full-blown VMs are heavy – for disk we need several GB just for a blank OS, and dozens of GB if you’re talking about a software stack such as Fusion MiddleWare (FMW), and the host machine needs to have the RAM and CPU to support it all too. Technologies such as Linux Containers go some way to making things lighter by abstracting out a chunk of the OS, but this isn’t something that’s reached the common desktop yet.

So whilst VMs are awesome, it’s not always practical to maintain a library of all of them on your local laptop (even 1TB drives fill up pretty quickly), nor will your laptop have the grunt to run more than one or two VMs at most. VMs like this are also local to your laptop or server – but wouldn’t it be neat if you could duplicate that VM and make a server based on it instantly available to anyone in the world with an internet connection? And that’s where The Cloud comes in, because it enables us to store as much data as we can eat (and pay for), and provision “hardware” at the click of a button for just as long as we need it, accessible from anywhere.

Here at Rittman Mead we make extensive use of Amazon Web Services (AWS) and their Elastic Computing Cloud (EC2) offering. Our website runs on it, our training servers run on it, and it scales just as we need it to. A class of 3 students is as easy to provision for as a class of 24 – no hunting around for spare servers or laptops, no hardware sat idle in a cupboard as spare capacity “just in case”.

One of the challenges that we’ve faced up until now is that all servers have had to be built from scratch in the cloud. Obviously we work with development VMs on local machines too, so wouldn’t it be nice if we could build VMs locally and then push them to the cloud? Well, now we can. Amazon offer a route to import virtual machines, and in this article I’m going to show how that works. I’ll use the superb SampleApp v406 VM that Oracle provide, because this is a great real-life example of a VM that is so useful, but many developers can find too memory-intensive to be able to run on their local machines all the time.

This tutorial is based on exporting a Linux guest VM from a Linux host server. A Windows guest probably behaves differently, but a Mac or Windows host should work fine since VirtualBox is supported on both. The specifics are based on SampleApp, but the process should be broadly the same for all VMs. 

Obtain the VM

We’re going to use SampleApp, which can be downloaded from Oracle.

  1. Download the six-part archive from http://www.oracle.com/technetwork/middleware/bi-foundation/obiee-samples–167534.html
  2. Verify the md5 checksums against those published on the download page:
  3. Unpack the archive using 7zip — the instructions for SampleApp are very clear that you must use 7zip, and not another archive tool such as winzip.
  4. Because we need to change a couple of things on the VM first (see below), we’ll have to import the VM to VirtualBox so that we can boot it up and make these changes.You can import using the VirtualBox GUI, or as I prefer, the VBoxManage command line interface. I like to time all these things (just because, numbers), so stick a time command on the front:

    This took 12 minutes or so, but that was on a high-spec system, so YMMV.

Preparing the VM

Importing Linux VMs to Amazon EC2 will only work if the kernel is supported, which according to an AWS blog post includes Red Hat Enterprise Linux 5.1 – 6.5. Whilst SampleApp v406 is built on Oracle Linux 6.5 (which isn’t listed by AWS as supported), we have the option of telling the VM to use a kernel that is Red Hat Enterprise Linux compatible (instead of the default Unbreakable Enterprise Kernel – UEK). There are some other pre-requisites that you need to check if you’re trying this with your own VM, including a network adaptor configured to use DHCP. The aforementioned blog post has details.

  1. Boot the VirtualBox VM, which should land you straight in the desktop environment, logged in as the oracle user.
  2. We need to modify a file as root (superuser). Here’s how to do it graphically, or use vi if you’re a real programmer:
    1. Open a Terminal window from the toolbar at the top of the screen
    2. Enter

      The sudo bit is important, because it tells Linux to run the command as root. (I’m on an xkcd-roll here: 1, 2)

    3. In the text editor that opens, you will see a header to the file and then a set of repeating sections beginning with title. These are the available kernels that the machine can run under. The default is 3, which is zero-based, so it’s the fourth title section. Note that the kernel version details include uek which stands for Unbreakable Enterprise Kernel – and is not going to work on EC2.
    4. Change the default to 0, so that we’ll instead boot to a Red Hat Compatible Kernel, which will work on EC2
    5. Save the file
  3. Optional steps:
    1. Whilst you’ve got the server running, add your SSH key to the image so that you can connect to it easily once it is up on EC2. For more information about SSH keys, see my previous blog post here, and a step-by-step for doing it on SampleApp here.
    2. Disable non-SSH key logins (in /etc/ssh/sshd_config, set PasswordAuthentication no and PubkeyAuthentication yes), so that your server once on EC2 is less vulnerable to attack. Particularly important if you’re using the stock image with Admin123 as the root password.
    3. Set up screen, and OBIEE and the database as a Linux service, both covered in my article here.
  4. Shutdown the instance by entering this at a Terminal window:

Export the VirtualBox VM to Amazon EC2

Now we’re ready to really get going. The first step is to export the VirtualBox VM to a format that Amazon EC2 can work with. Whilst they don’t explicitly support VMs from VirtualBox, they do support the VMDK format – which VirtualBox can create. You can do the export from the graphical interface, or as before, from the command line:

If you compare the result of this to what we downloaded from Oracle it looks pretty similar – an OVF file and a VMDK file. The only difference is that the VMDK file is updated with the changes we made above, including the modified kernel settings which are crucial for the success of the next step.

We’re ready now to get all cloudy. For this, you’ll need:

  1. An AWS account
    1. You’ll also need your AWS account’s Access Key and Secret Key
  2. AWS EC2 commandline tools installed, along with a Java Runtime Environment (JRE) 1.7 or greater:

  3. Set your credentials as environment variables:
  4. Ideally a nice fat pipe to upload the VM file over, because at 30GB it is not trivial (not in 2014, anyway)

What’s going to happen now is we use an EC2 command line tool to upload our VMDK (virtual disk) file to Amazon S3 (a storage platform), from where it gets converted into an EBS volume (Elastic Block Store, i.e. a EC2 virtual disk), and from there attached to a new EC2 instance (a “server”/”VM”).

Before we can do the upload we need an S3 “bucket” to put the disk image in that we’re uploading. You can create one from https://console.aws.amazon.com/s3/. In this example, I’ve got one called rmc-vms – but you’ll need your own.

Once the bucket has been created, we build the command line upload statement using ec2-import-instance:

Points to note:

  • m3.large is the spec for the VM. You can see the available list here. In the AWS blog post it suggests only a subset will work with the import method, but I’ve not hit this limitation yet.
  • region is the AWS Region in which the EBS volume and EC2 instance will be built. I’m using ew-west-1 (Ireland), and it makes sense to use the one geographically closest to where you or your users are located. Still waiting for uk-yorks-1
  • architecture and platform relate to the type of VM you’re importing.

The upload process took just over 45 minutes for me, and that’s from a data centre with a decent upload:

Once the upload has finished Amazon automatically converts the VMDK (now residing on S3) into a EBS volume, and then attaches it to a new EC2 instance (i.e. a VM). You can monitor the status of this task using ec2-describe-conversion-tasks, optionally filtered on the TaskId returned by the import command above:

This is now an ideal time to mention as a side note the Linux utility watch, which simply re-issues a command for you every x seconds (2 by default). This way you can leave a window open and keep an eye on the progress of what is going to be a long-running job

And whilst we’re at it, if you’re using a remote server to do this (as I am, to take advantage of the large bandwidth), you will find screen invaluable for keeping tasks running and being able to reconnect at will. You can read more about screen and watch here.

So back to our EC2 import job. To start with, the task will be Pending: (NB unlike lots of CLI tools, you read the output of this one left-to-right, rather than as columns with headings)

After a few moments it gets underway, and you can see a Progress percentage indicator: (scroll right in the code snippet below to see)

Note that at this point you’ll see also see an Instance in the EC2 list, but it won’t launch (no attached disk – because it’s still being imported!)

If something goes wrong you’ll see the Status as cancelled, such as in this example here where the kernel in the VM was not a supported one (observe it is the UEK kernel, which isn’t supported by Amazon):

After an hour or so, the task should complete:

At this point you can remove the VMDK from S3 (and should do, else you’ll continue to be charged for it), following the instructions for ec2-delete-disk-image

Booting the new server on EC2

Go to your EC2 control panel, where you should see an instance (EC2 term for “server”) in Stopped state and with no name.

Select the instance, and click Start on the Actions menu. After a few moments a Public IP will be shown in the details pane. But, we’re not home free quite yet…read on.

Firewalls

So this is where it gets a bit tricky. By default, the instance will have launched with Amazon’s Firewall (known as a Security Group) in place which – unless you have an existing AWS account and have modified the default security group’s configuration – is only open on port 22, which is for ssh traffic.

You need to head over to the Security Group configuration page, accessed in several ways but easiest is clicking on the security group name from the instance details pane:

Click on the Inbound tab and then Edit, and add “Custom TCP Rule” for the following ports:

  • 7780 (OBIEE front end)
  • 7001 (WLS Console / EM)
  • 5902 (oracle VNC)

You can make things more secure by allowing access to the WLS admin (7001) and VNC port (5902) to a specific IP address or range only.

Whilst we’re talking about security, your server is now open to the internet and all the nefarious persons out there, so you’ll be wanting to harden your server not least by resetting all the passwords to ones which aren’t publicly documented in the SampleApp user documentation!

Once you’ve updated your Security Group, you can connect to your server! If you installed the OBIEE and database auto start scripts (and if not, why not??) you should find OBIEE running just nicely on http://[your ip]:7780/analytics – note that the port is 7780, not 9704.

2014-09-09_20-21-23

If you didn’t install the script, you will need to start the services manually per the SampleApp documentation. To connect to the server you can ssh (using Terminal, PuTTY, etc) to the server or connect on VNC (Admin123 is the password). For VNC clients try Screen Share on Macs (installed by default), or RealVNC on Windows.

Caveats & Disclaimers

  • Running a server on AWS EC2 costs real money, so watch out. Once you’ve put your credit card details in, Amazon will continue to charge your card whilst there are chargeable items on your account (EBS volumes, instances – running or not- , and so on). You can get an idea of the scale of charges here.
  • As mentioned above, a server on the open internet is a lot more vulnerable than one virtualised on your local machine. You will get poked and probed, usually by automated scripts looking for open ports, weak passwords, and so on. SampleApp is designed to open the toybox of a pimped-out OBIEE deployment to you, it is not “hardened”, and you risk learning the tough way about the need for it if you’re not careful.

Cloning

Amazon EC2 supports taking a snapshot of a server, either for backup/rollback purposes or spinning up as a clone, using an Amazon Machine Image (AMI). From the Instances page, simply select “Create an Image” to build your AMI. You can then build another instance (or ten) from this AMI as needed, exact replicas of the server as it was at the point that you created the image.

Lather, Rinse, and Repeat

There’s a whole host of VirtualBox “appliances” out there, and some of them such as the developer-tools-focused ones only really make sense as local VMs. But there are plenty that would benefit from a bit of “Cloud-isation”, where they’re too big or heavy to keep on your laptop all the time, but are handy to be able to spin up at will. A prime example of this for me is the EBS Vision demo database that we use for our BI Apps training. Oracle used to provide an pre-built Amazon image (know as an AMI) of this, but since withdrew it. However, Oracle do publish Oracle VM VirtualBox templates for EBS 12.1.3 and 12.2.3 (related blog), so from this with a bit of leg-work and a big upload pipe, it’s a simple matter to brew your own AWS version of it — ready to run whenever you need it.

Sunday Times Tech Track 100

September 9th, 2014 by

Over the weekend, Rittman Mead was listed in the Sunday Times Tech Track 100. We are extremely proud to get recognition for the business as well as our technical capability and expertise.

A lot of the public face of Rittman Mead focuses on the tools and technologies we work with. Since day one we have had a core policy to share as much information as possible. Even before the advent of social media, we shared pretty much everything we knew through either our blog or by speaking at conferences, but we very rarely talk about the business itself. However, a lot of the journey we have gone through over the last 7 years has been about the growth and maintenance of a successful, sustainable, multi-national business. We have been able to talk about, educate and evangelise about the tools and technologies as a result of having the successful business to support this.

I remember during one interview we did several years ago the candidate asked (and I’m paraphrasing): “How do you guys make any money, all I see/read is people sitting in airports writing blog posts about leading edge technologies?”.

One massive benefit from this is we often face the same problems (albeit on a different scales) to those that we talk about with customers, so we have been able to better understand the underlying drivers and proposed solutions for our clients.

From a personal point of view, this has meant spending a lot more time looking at contracts as opposed to code and reading business books/blogs as opposed to technical ones. However, it has been well worth it and I would like to say thanks to all of those both inside and outside of the company who have helped contribute to this success.

Analyzing Twitter Data using Datasift, MongoDB, Hive and ODI12c

September 8th, 2014 by

Last week I posted an article on the blog around analysing Twitter data using Datasift, MongoDB and Pig, where I used the Datasift service to stream tweets about Rittman Mead into a MongoDB NoSQL database, and then queried the dataset using Pig. The context for this is the idea of a “data reservoir”, where we supplement the more traditional file and relational datasets we find in data warehouses with other data, typically machine generated, unstructured or very low-level, to add context to the numbers in our reporting system. In the example I quoted in the article, it’d be very interesting to take the activity we record against our blog and website and correlate that with the “conversation” that happens about it in the social media world; for example, were the hits for a particular article due to it been mentioned in a tweet, and did a spike in activity correspond to a particularly influential Twitter user retweeting something we’d tweeted?

NewImage

In that previous article I’d used Pig to access and analyse the data, in part because I saw a match between the nested datasets in a typical DataSift Twitter message and the relations, tuples and bags you get in a Pig schema. For example, if you look at the Tweet from Borkur in the screenshot below from RoboMongo, a Mac OS X client for MongoDB that I’ve found useful, you can see the author details nested inside the interaction details, and the Type attribute having many values under the Trends parent attribute – these map well onto Pig tuples and bags respectively.

NewImage

What I’d really like to do with this dataset, though, is to take certain elements of it and use that to supplement the data I’m loading using ODI12c. Whilst ODI can run arbitrary R, Pig and shell scripts using the ODI Procedure feature (as I did here to make use of Sqoop, before Oracle added Sqoop KMs to ODI12.1.3), it gets the best out of Hadoop when it can access data using Hive, the SQL layer over Hadoop that represents HDFS data as rows and columns, and allows us to SELECT and INSERT data using SQL commands – or to be precise, a dialect of SQL called HiveQL. But how will Hive cope with the nested and repeating data structures in a DataSift Twitter message, and allow us to get just the data out that we’re interested in?

In fact, the MongoDB connector for Hadoop that I used for Pig the other day also comes with Hive connectivity, in the form of a SerDe that lets Hive report against data in a MongoDB database (David Allen blogged about another MongoDB Hive storage handler a while ago, in an article about MongoDB and ODI). What’s more, this Hive connector for MongoDB is actually easier to work with that the Pig connector, as instead of worrying about Tuples and Bags you can just pick out the nested attributes that you’re interested in using a dot notation. For example, if I’m only interested in the InteractionID, username, tweet content and number of followers within a particular Twitter dataset, I can create a table that looks like this in Hive:

And at that point, it’s pretty easy to bring the dataset into ODI12c, through the IKM Hive to Hive Control Append knowledge module, and join up the Twitter dataset with the website log data that’s coming in via Flume. ODI can connect to Hive via JDBC drivers supplied with CDH4/5, and once you register the Hive connection and reverse-engineer the Hive metastore metadata into ODI’s repository, the complexity of the underlying Hive storage is hidden and you’re just presented with tables and columns, just like any other datastore type.

NewImage

Starting with the Twitter data first, I create a Hive table outside of ODI that returns the precise set of tweet attributes that I’m interested in, and then filter that dataset down to just the tweets that link to content on our website, by filtering on the tweet link’s URL matching the start of our website address.

NewImage

Then I load-up the hits from the Rittman Mead website, previously landed into Hadoop using Flume and exposed to ODI as another Hive table, filter out all the non-blog page accesses and keep just the URL part of the Apache Weblog request field, removing the transport mechanism and other bits around it.

NewImage

Then, I use a final ODI mapping to join the two datasets together, using ODI’s ability to apply HiveQL expressions to the incoming datasets so that’ve got the same format – trailing ‘/‘ at the end of the URL, no ampersand and query text at the end of the URL, and so on. Both this and the previous transformation are great examples of where ODI can help with this sort of work, making it pretty easy to munge and correct your data so that you’re then able to match-up the two different sources.

NewImage

Then it’s just a case of creating a package or load plan to sequence the mappings, and then run them using the local or standalone agent. You can see the individual KM steps running on the left-hand side, with ODI generating HiveQL queries which in turn are translated into MapReduce and run in parallel across the Hadoop cluster.

NewImage

And then, at the end of the process, I’ve got a Hive table of all of our blog articles that have been mentioned on Twitter (since we started consuming the tweet feed, a day or so ago), with the number of page requests and the number of times that page got mentioned in tweets.

NewImage

Obviously there’s a lot more we can do with this; we can access the number of followers each twitter user has, along with their location, gender and the sentiment (positive, negative, neutral) of the tweet. From that we can work out some impact from the twitter activity, and we can also add to it data from other sources such as Facebook, LinkedIn and so on to get a fuller picture of the activity around our site. Then, the data we’re gathering in can either be left in MongoDB, or I can use these ODI mappings to either archive it in Hive tables, or export the highlights out to Oracle Database using Sqoop or Oracle Loader for Hadoop.

Analyzing Twitter Data using Datasift, MongoDB and Pig

September 5th, 2014 by

If you followed our recent postings on the updated Oracle Information Management Reference Architecture, one of the key concepts we talk about is the “data reservoir”. This is a pool of additional data that you can add to your data warehouse, typically stored on Hadoop or NoSQL databases, where you store unstructured, semi-structured or unprocessed structured data in large volume and at low cost. Adding a data reservoir gives you the ability to leverage the types of data sources that previously were thought of as too “messy”, too high-volume to store for any serious amount of time, or require processing or storing by tools that aren’t in the usual relational data warehouse toolset.

NewImage

By formally including them in your overall information management architecture though, with common tools, security and data governance over the entire dataset, you give your users the ability to consider the whole “360-degree view” of their customers and their interactions with the market.

To take an example, a few weeks ago I posted a series of articles on the blog where I captured user activity on our website, http://www.rittmanmead.com, transported it to one of our Hadoop clusters using Apache Flume, and then analysed it using Hive, Pig and finally Spark. In one of the articles I used Pig and a geocoding API to determine the country that each website visitor came from, and then in a final five-part series I automated the whole process using ODI12c and then copied the final output tables to Oracle using Oracle Loader for Hadoop. This is quite a nice example of ETL-offloading into Hadoop, with an element of Hadoop-native event capture using Flume, but once the processing has finished the data moves out of Hadoop and into the Oracle database.

NewImage

What would be interesting though would be to start adding data into Hadoop that’s permanent, not transitory as part of an ETL process, to start building out this concept of the “data reservoir”. Taking our website activity dataset, something that would really add context to the visits to our site would be corresponding activity on social networks, to see who’s linking to our posts, who’s discussing them, whether those discussions are positive or negative, and which wider networks those people belong to. Twitter is a good place to start with this as it’s the place we see our articles and activities most discussed, but it’s be good to build out this picture over time to add in activity on social networks such as Facebook, Youtube, LinkedIn and Google+; if we did this, we’d be able to consider a much broader and richer picture when looking at activity around Rittman Mead, potentially correlating activity and visits to our website with mentions of us in the press, comments made by our team and the wider picture of what’s going on in our world.

NewImage

There are a number of ways you can bring Twitter data into your Hadoop cluster or data warehouse, but the most convenient way we’ve found is to use DataSift, a social media aggregation site and service that license raw feeds from the likes of Twitter, Facebook, WordPress and other social media platforms, enhancing the data feeds with sentiment scores and other attributes, and then sell access to the feeds via a number of formats and APIs. Accessing Twitter data through DataSift costs money, particularly if you want to go back and look at historical activity vs. just filtering on a few keywords in new Twitter activity, but they’re very developer-friendly and able to provide greater volumes of firehose activity than the standard Twitter developer API allows.

So assuming you can get access to a stream of Twitter data on a particular topic – in our case, all mentions of our website, our team’s Twitter handles, retweets of our content etc – the question then becomes one of how to store the data. Looking at the Datasift Sample Output page, each of these streams delivers their payloads via JSON documents, XML-like structures that nest categories of tweet metadata within parent structures that make up the total tweet data and metadata dataset.

NewImage

And there’s a good reason for this; individual tweets might not use every bit of possible tweet metadata, for example not including entries under “mentions” or “retweets” if those aren’t used in a  particular message. Certain bits of metadata might be repeated X numbers of times – @ mentions, for example, and the JSON document might have a different structure altogether if a different JSON schema version is used for a particular tweet. Altogether not an easy type of data structure for a relational database to hold – though Oracle 12.1.0.2.0 has just introduced native JSON support to the core Oracle database – but NoSQL databases in contrast find these sorts of data structures easy, and one of the most popular for this type of work is MongoDB.

MongoDB is a open-source “document” database that’s probably best known to the Oracle world through this internet cartoon; what the video is getting at is NoSQL advocates recommending databases such as MongoDB for large-scale web work when something much more mainstream like mySQL would do the job better, but where NoSQL and document-style database come into their own is storing just this type of semi-structured, schema-on-read datasets. In fact, Datasift support MongoDB as an API end-point for their Twitter feed, so let’s go ahead and set up a MongoDB database, prepare it for the Twitter data, and then set-up a Datasift feed into it.

MongoDB installation on Linux, for example to run alongside a Hadoop installation, is pretty straightforward and involves adding a YUM repository and then running “sudo yum install mongodb-org” (there’s an OS X installation too, but I wanted to run this server-side on my Hadoop cluster). Once you’ve installed the MongoDB software, you start the mongod service to enable the server element, and then log into the mongo command-shell to create a new database.

MongoDB, being a schema-on-read database, doesn’t require you to set up a database schema up-front; instead, the schema comes from the data you load into it, with MongoDB’s equivalent of tables called “collections”, and with those collections made up of documents, analogous to rows in Oracle. Where it gets interesting though is that collections and databases only get created when you first start using them, and individual documents can have slightly, or even completely, different schema structures to each other – which makes them ideal for holding the sorts of datasets generated by Twitter, Facebook and DataSift.

Let’s create a couple of simple documents, and then add those to a collection. Note that the document becomes available just by declaring it, as does the collection when I add documents to it. Note also that the query language we’re using to work with MongoDB is Javascript, again making it particularly suited to JSON documents, and web-type environments.

And note also how the second entry (document) in the collection has a different schema to the entry above it – perfect for our semi-structured Twitter data, and something we could store as-is in MongoDB in this loose data format and then apply more formal structures and schemas to when we come to access the data – as we’ll do in a moment using Pig, and more formally using ODI and Hive in the next article in this series.

Setting up the Twitter feed from DataSift is a two-stage process, once you’ve got an account with them and an API key; first you define your search terms against a nested document model for the data source, then you activate the feed, in this case into my MongoDB database, and wait for the tweets to roll in. For my feed I selected tweets written by myself and some of the Rittman Mead team, tweets mentioning us, and tweets that included links to our blog in the main tweet contents (there’s also a graphical query designer, but I prefer to write them by hand using what DataSift call their “curated stream definition language” (CSDL).

NewImage

You can then preview the feed, live, or go back and sample historic data if you’re interested in loading old tweets, rather than incoming new ones. Once you’re ready you then need to activate the feed, in my case by calling a URL using CURL with a bunch of parameters (our API key and other sensitive data has been masked):

The “hash” in the parameter list is the specific feed to activate, and the output type is MongoDB. The collection name is new, and will be created by MongoDB when the first tweet comes in; let’s run the curl command now and sit back for a while, and wait for some twitter activity to arrive in MongoDB …

… and a couple of hours later, eight tweets have been captured by the DataSift filter, with the last of them being one from Michael Rainey about his trip tonight to the Seahawks game:

If you’ve not looked at Twitter metadata before, it’s surprising how much metadata accompanies what’s ostensibly an 140-character tweet. As well as details on the author, where the tweet was sent from, what Twitter client sent the tweet and details of the tweet itself, there’s details and statistics on the sender, the number of followers they’ve got and where they’re located, a list of all other Twitter users mentioned in the tweet and any URLs and images referenced.

Not every tweet will use every element of metadata, and some tweets will repeat certain attributes – other Twitter users you’ve mentioned in the tweet, for example – as many times as there are mentions. Which makes Twitter data a prime candidate for analysis using Pig and Spark, which handle easily the concept of nested data structures, tuples (ordered lists of data, such as attribute sets for an entity such as “author”), and bags (sets of unordered attributes, such as the list of @ mentions in a tweet).

There’s a MongoDB connector for Hadoop on Github which allows MapReduce to connect to MongoDB databases, running MapReduce jobs on MongoDB storage rather than HDFS (or S3, or whatever). This gives us the ability to use languages such as Pig and Hive to filter and aggregate our MongoDB data rather than MongoDB’s Javascript API, which isn’t as fully-featured and scaleable as MapReduce and has limitations in terms of the number of documents you can include in aggregations; let’s start then by connecting Pig to our MongoDB database, and reading in the documents with no Pig schema applied:

So there’s nine tweets in the MongoDB database now. Let’s take a look at one of the documents by creating a Pig alias containing just a single record.

And there’s Michael’s tweet again, with all the attributes from the MongoDB JSON document appended together into a single record. But in this format the data isn’t all that useful as we can’t easily access individual elements in the Twitter record; what would be better would be to apply a Pig schema definition to the LOAD statement, using the MongoDB document field listing that we saw when we displayed a single record from the MongoDB collection earlier.

I can start by referencing the document fields that become simple Pig dataypes; ID and interactionId, for example:

Where the MongoDB document has fields nested within other fields, you can reference these as a tuple if they’re a set of attributes under a common header, or a bag if they’re just a list of values for a single attribute; for example, the “username” field is contained within the author tuple, which in-turn is contained within the interaction tuple, so to count tweets by author I’d need to first flatten the author tuple to turn its fields into scalar fields, then project out the username and other details; then I can group the relation in the normal way on those author details, and generate a count of tweets, like this:

So there’s obviously a lot more we can do with the Twitter dataset as it stands, but where it’ll get really interesting is combining this with other social media interaction data – for example from Facebook, LinkedIn and so on – and then correlating that with out main site activity data. Check back in a few days when we’ll be covering this second stage in a further blog article, using ODI12c to orchestrate the process.

Oracle 12.1.0.2 and Data Warehouses

September 1st, 2014 by

If you follow Blogs and Tweets from the Oracle community you won’t have missed hearing about the recent release of the first patch-set for Oracle 12c. With this release there are some significant pieces of new functionality that will be of interest to Data Warehouse DBAs and architects. The headline feature that most Oracle followers will know of is the new in-memory option. In my opinion this is a game-changer for how we design reporting architectures; it gives us an effective way to build operational reporting over the reference data architecture described by Mark Rittman a few weeks ago. Of course, the database team here at Rittman Mead have been rolling up our sleeves and getting into in-memory technology for quite a while now, Mark even featured in the official launch presentation by Larry Ellison with the now famous “so easy it’s boring” quote. Last week Mark published the first of our Rittman Mead in-memory articles, with the promise of more in-memory articles to come including my article for the next edition of UKOUG’s “Oracle Scene”.

However, the in-memory option is not the only new feature that is going to be a benefit to us in the BI/DW world. One of the new features I am going to describe is Exadata only, but the first one I am going to mention is generally available in the 12.1.0.2 database.

Typically, data warehouse queries are different from those seen in the OLTP world – in DW we tend to access a large number of rows and probably aggregate things up to answer some business question. Often we are not using indexes and instead scanning tables or table partitions is the norm. Usually, the data we need to aggregate is widely scattered across the table or partition. Data Warehouse queries often look at records that share a set of common attributes; we look at the sales for the ‘ACME’ widget or the value of items shipped to Arizona. For us there can be great advantage if data we use together is stored together, and this is where Attribute Clustering can pay a part.

Attribute Clustering is usually configured on the table at at DDL time and in-effect controls the ordering of data inserted by DIRECT PATH operations, Oracle does not enforce this ordering for conventional inserts, this may not be an issue in data warehouses as bulk-batch operations typically use APPEND inserts, which are direct path inserts, or partition operations, it may be more of an issue with some of the real-time conventional path loading paradigms. In addition to Direct Path load operations Attribute Clustering can also occur when you do Alter table MOVE type operations (this also includes operations such as PARTITION SPLIT). On the surface, Attribute Clustering sounds little different to using an ORDER by on an append insert and hoping that Oracle actually stores the data where you expect it to. However, Attribute Clustering gives us two other possibilities in how we can order the data in the cluster.

Firstly, we can cluster on columns from JOINED dimension tables, for example in a SALES DW we may have a sales fact with a product key at the SKU level, but we often join to the product dimension and report at the Product Category level. In this case we can cluster our sales fact table so that each product category appears in the same cluster. For example, we have just opened a chain a supermarkets with a wide but uninspiring range of brands and products (see the tiny piece of our product dimension table below)

NewImage

As you can see, our Product PK has no relationship at all to the type of product being sold. In our Kimball-style data warehouse we typically store the product key on the fact table and join to the product dimension to obtain all of the other product attributes and hierarchy members. This is essentially what we can do with join Attribute Clustering, in our example we can cluster our fact table on PRODUCT_CATEGORY so that all of the Laundry sales are physically close to each other in the Fact table.

Notice we are clustering on a join to the product dimension table’s “product_category” column, we are also clustering on sales_date, this is especially important in the case of partitioned fact tables so that the benefits of clustering align to the partitioning strategy.  We are also not restricted in our clustering to just one join, if we wanted to we could also cluster our sales by store region e.g. the Colorado laundry product sales are located in the same area of the sales table. To use Join Attribute Clustering we need to define the PK / FK relationships between fact and dimension, however it is always good practice to have that in place as it helps the CBO so much with query plan evaluation

Secondly, notice the BY LINEAR ORDER clause in the table DDL. Of the two ordering options, Linear Order is the most basic form of clustering, it this case we have our data structured so that all the items for a sales day are clustered together and within that cluster we order by product category and those categories are in turn ordered by store_id. The other way we can cluster is BY INTERLEAVED ORDER; here, Oracle maps a combination of dimensional values to a single cluster value using a z-order curve fitting approach. This sounds complex but it ensures that items that are frequently queried together are co-located in the disk blocks in the storage.

Interleaved ordering is probably the best choice for data warehousing at it aligns well with how we access data in our queries. Although we could include all of the dimension keys in our ordering list, it is going to be more benefit to just include a subset of dimensions; typically for retail I’d go with DATE (or something that correlates to the time based partition key of the fact table), the product  and the store. Of course we can again join to the dimension tables and cluster at higher hierarchy levels such as product category and store region. The Oracle 12c Data Warehousing guide gives some good advice, but you can’t go far wrong if you think about clustering items together that will be queried together

Clustering data can give us some advantages in performance. Better data compression and improved index range scans spring to mind, but to get most benefits we should also look at another new feature, zone-maps. Unlike Attribute Clustering, Zone Maps are Engineered Systems only, In a way they are similar to storage indexes already found on Exadata, but they have some additional advantages, they are also somewhat different from zone maps encountered in other DB vendors’ products such as Netezza.

In Exadata, a storage index can provide the maximum and minimum values encountered for a column in storage cell. I say “can” as there is no guarantee that range for a given column is held in the storage index. Zone Maps on the other hand will always provide maxima and minima for all of the columns specified at zone map creation. The zone map is orientated in terms of contiguous database blocks and is materialized so that it is physically persisted in the database and thus survives DB startups. Like Materialized views Materialized zone maps can become stale and need to be maintained.

We can define a zone map on one or more table columns and just like Attribute Clustering we may also create zone maps on table joins. As a table can only have one zone map it is important to include all of the columns you wish to track. Zone Maps are designed to work well with attribute clustering, in fact it is just a simple DDL statement to add a zone-map to an Attribute Clustered table so that the zone map tracks the same attributes as the clustering. This is where we get the major performance boost from attribute clustering, Instead of looking at the whole table the zone map tells us which ranges of database blocks contain data that matches our query predicates.

Zone Maps with Attribute Clustering gives us another powerful tool to boost DW performance on Exadata – we can do star queries without resorting to bitmap indexes and we minimise IO when scanning fact tables as we only need look where we know the data to be. Exciting times!

Website Design & Build: tymedia.co.uk