Rittman Mead anuncia su catálogo de cursos en Español y en Portugués.

November 12th, 2014 by


UnitedKingdom-ukflag brazil-flag

Tenemos el agrado de anunciarles que desde ahora Rittman Mead ofrece su catálogo completo de cursos en Español y en Portugués.  Esta es una gran noticia para quienes viven en América Latina, España y Portugal, ya que ahora pueden recibir la mejor capacitación, incluyendo el material teórico/práctico, en su idioma local.

Los cursos se dictan tanto en forma remota, con clases virtuales en vivo a través de nuestra plataforma web como también en forma presencial. De ambas maneras, cada estudiante recibirá el material teórico/práctico en forma electrónica y tendrá acceso durante el curso a una máquina virtual de uso exclusivo, para realizar las prácticas.

Ofrecemos un amplia variedad de cursos de Oracle Business Intelligence y tecnologías de Data Warehousing: desde cursos intensivos de 5 días (bootcamps)  a conjunto de cursos específicos orientados por rol de trabajo. Acceda al catálogo completo en Español, Portugués o Inglés.

¿Está interesado en nuestros cursos o quiere ser nuestro partner en capacitación? No dude en consultarnos al mail training@rittmanmead.com

Auditing OBIEE Presentation Catalog Activity with Custom Log Filters

November 10th, 2014 by

A question that I’ve noticed coming up a few times on the OBIEE OTN forums goes along the lines of “How can I find out who deleted a report from the Presentation Catalog?”. And whilst the BI Server’s Usage Tracking is superb for auditing who ran what report, we don’t by default have a way of seeing who deleted a report.

The Presentation Catalog (or “Web Catalog” as it was called in 10g) records who created an object and when it was last modified, accessible through both OBIEE’s Catalog view, and the dedicated Catalog Manager tool itself:

But if we want to find out who deleted an object, or maybe who modified it before the most recent person (that is, build up an audit trail of who modified an object) we have to dig a bit deeper.

Presentation Services Log Sources

Perusing the OBIEE product manuals, one will find documented additional Logging in Oracle BI Presentation Services options. This is more than just turning up the log level en masse, because it also includes additional log writers and filters. What this means is that you can have your standard Presentation Services logging, but then configure a separate file to capture more detailed information about just specific goings on within Presentation Services.

Looking at a normal Presentation Services log (in $FMW_HOME/instances/instance1/diagnostics/logs/OracleBIPresentationServicesComponent/coreapplication_obips1/) you’ll see various messages by default – greater or fewer depending on the health of your system – but they all use the Location stack track, such as this one here:

And it is the Location that is of interest to us here, because it’s what gives hints about the types of log messages that can be emitted and that we may want to filter. For example, the one quoted above is evidently something to do with the Presentation Catalog and SOAP, which I’d guess is a result of Catalog Manager (which uses web services/SOAP to access OBIEE).

To get a full listing of all the possible log sources, first set up the BI command line environment with bi-init:

and then run:

(If you get an error, almost certainly you didn’t set up the command line environment properly with bi-init). You’ll get an list of over a thousand lines (which gives you an idea of quite how powerful this granular logging is). Assuming you’ll want to peruse it at your leisure, it makes sense to write it to disk which if you’re running this on *nix you can simply do thus:

To find what you want on the list, you can just search through it. Looking for anything related to “catalog” and narrowing it down further, I came up with these interesting sources:

Configuring granular Presentation Services logging

Let us see how to go and set up this additional logging. Remember, this is not the same as just going to Enterprise Manager and bumping the log level to 11 globally – we’re going to retain the default logging level, but for just specific actions that occur within the tool, capture greater information. The documentation for this is here.

The configuration is found in the instanceconfig.xml file, so like all good sysadmins let’s take a backup first:

Now depending on your poison, open the instanceconfig.xml directly in a text editor from the command line, or copy it to a desktop environment where you can open it in your favourite text editor there. Either way, these are the changes we’re going to make:

  1. Locate the <Logging> section. Note that within it there are three child entities – <Writers>, <WriterClassGroups> and <Filters>. We’re going to add an entry to each.

  2. Under <Writers>, add:

    This defines a new writer than will write logs to disk (FileLogWriter), in 100MB files of which it’ll keep 10. If you’re defining additional Writers, make sure they have a unique writerClassId See docs for detailed syntax.

  3. Under <WriterClassGroups> add:

    This defines the RMLog class group as being associated with writerClassId 6 (as defined above), and is used in the Filters section to direct logs. If you wanted you could log entries to multiple logs (eg both file and console) this way.

  4. Under <Filters> add:

    Here we’re defining two event filters, with levels turned up to max (32), directing the capture of any occurences to the RMLog writerClassGroup.

After making the changes to instanceconfig.xml, restart Presentation Services:

Here’s the completed instanceconfig.xml from the top of the file through to the end of the <Logging> section, with my changes overlayed the defaults:

Granular logging in action

Having restarted Presentation Services after making the above change, I can see in my new log file whenever an item from the Presentation Catalog is deleted, by whom, and from what IP address:

And the same for when a file is moved/renamed:

Be careful with your logging

Just because you can log everything, don’t be tempted to actually log everything. Bear in mind that we’re crossing over from simple end-user logging here into the very depths of the sawserver (Presentation Services) code, accessing logging that is extremely diagnostic in nature. Which for our specific purpose of tracking when someone deletes an object from the Presentation Catalog is handy. But as an example, if you enable saw.catalog.local.writeObject event logging, you may think that it will record who changed a report when, and that might be useful. But – look at what gets logged every time someone saves a report:

It’s the whole report definition! And this is a very very small report – real life reports can be page after page of XML. That is not a good level at which to be recording this information. If you want to retain this kind of control over who is saving what report, you should maybe be looking at authorisation groups for your users in terms of where they can save reports, and have trusted ‘gatekeepers’ for important areas.

As well as the verbose report capture with the writeObject event, you also get this background chatter:

volatileuserdata is presumably just that (user data that is volatile, constantly changing) and not something that it would be of interest to anyone to log – but you can’t capture actual report writes without capturing this too. On a busy system you’re going to be unnecessarily thrashing the log files if you capture this event by routine – so don’t!


The detailed information is there for the taking in Presentation Services’ excellent granular log sources – just be careful what you capture lest you bite off more than you can chew.

Analytics with Kibana and Elasticsearch through Hadoop – part 3 – Visualising the data in Kibana

November 4th, 2014 by

In this post we will see how Kibana can be used to create visualisations over various sets of data that we have combined together. Kibana is a graphical front end for data held in ElasticSearch, which also provides the analytic capabilities. Previously we looked at where the data came from and exposing it through Hive, and then loading it into ElasticSearch. Here’s what we’ve built so far, the borders denoting what was covered in the previous two blog articles and what we’ll cover here:


Now that we’ve got all the data into Elasticsearch, via Hive, we can start putting some pictures around it. Kibana works by directly querying Elasticsearch, generating the same kind of queries that you can run yourself through the Elasticsearch REST API (similar to what we saw when defining the mappings in the previous article). In this sense there is a loose parallel between OBIEE’s Presentation Services and the BI Server – one does the fancy front end stuff, generating queries to the hard-working backend.

I’ve been looking at both the current release version of Kibana (3.x), and also the beta of Kibana 4 which brings with it a very smart visualiser that we’ll look at in detail. It looks like Kibana 4 is a ground-up rewrite rather than modifications to Kibana 3, which means that at the moment it is a long way from parity of functionality – which is why I’m flitting between the two. For a primer in Kibana 3 and its interface see my article on using it to monitor OBIEE.

Installing Kibana is pretty easy in Kibana 3, involving a simple config change to a web server of your choice that you need to provide (details in my previous blog), and has been made even easier in Kibana 4 which actually ships with its own web server so you literally just download it, unarchive it and run it.

So the starting point is the assumption we have all the data in a single Elasticsearch index all_blog, with three different mappings which Kibana refers to accurately as “types”: blog posts, blog visits, and blog tweets.

Kibana 3

Starting with a simple example first, and to illustrate the “analysed” vs “non-analysed” mapping configuration that I mentioned previously, let’s look at the Term visualisation in Kibana 3. This displays the results of an Elasticsearch analysis against a given field. If the field has been marked as “not analysed” we get a listing of the literal values, ranking by the number of times they repeat. This is useful, for example, to show who has blogged the most:

But less useful if we want to analyse the use of words in blog titles, since non-analysed we just get a listing of blog titles:

(there are indeed two blog posts entitled “Odds and Ends” from quite a while ago 1 2)

Building the Term visualisation against the post title field that has been analysed gives us a more interesting, although hardly surprising, result:

Here I’ve weeded out the obvious words that will appear all the time (‘the’, ‘a’, etc), using the Exclude Term(s) option.

Term visualisations are really useful for displaying any kind of top/bottom ranked values, and also because they are interactive – if you click on the value it is applied as a filter to the data on the page. What that means is that we can take a simple dashboard using the two Term objects above, plus a histogram of posts made over time:

And by clicking on one of the terms (for example, my name in the authors list) it shows that I only started posting on the Rittman Mead blog three years ago, and that I write about OBIEE, performance, and exalytics.

Taking another tack, we can search for any term and add it in to the histogram. Here we can see when interest in 11g (the green line), as well as big data (red), started :

Note here we’re just analyzing post titles not content so it’s not 100% representative. Maybe loading in our post contents to Elasticsearch will be my next blog post. But that does then start to get a little bit meta…

Adding in a Table view gives us the ability to show the actual posts and links to them.

Let’s explore the data a bit. Clicking on an entry in the table gives us the option to filter down further

Here we can see for a selected blog post, what its traffic was and when (if at all) it was tweeted:

Interesting in the profile of blog hits is a second peak that looks like it might correlate with tweets. Let’s drill further by drag-clicking (brushing) on the graph to select the range we want, and bring in details of those tweets:

So this is all pretty interesting, and importantly, very rapid in terms of both the user experience and the response time.

Kibana 4

Now let’s take a look at what Kibana 4 offers us. As well as a snazzier interface (think hipster data explorer vs hairy ops guy parsing logs), its new Visualiser builder is great. Kibana 3 dumped you on a dashboard in which you have to build rows and panels and so on. Kibana 4 has a nice big “Visualize” button. Let’s see what this does for us. To start with it’s a nice “guided” build process:

By default we get a single bar, counting all the ‘documents’ for the time period. We can use the Search option at the top to filter just the ‘type’ of document we want, which in this case is going to be tweets about our blog articles.

Obviously, a single bar on its own isn’t that interesting, so let’s improve it. We’ll click the “Add Aggregation” button (even though to my pedantic mind the data is already aggregated to total), and add an X-Axis of date:

The bucket size in the histogram defaults to automatic, and the the axis label tells us it’s per three hours. At the volume of tweets we’re analysing, we’d see patterns better at a higher grain such as daily (the penultimate bar to the right of the graph shows a busy day of tweets that’s lost in the graph at 3-hour intervals):

NB at the moment in Kibana 4 intervals are fixed (in Kibana 3 they were freeform).

Let’s dig into the tweets a bit deeper. Adding a “Sub Aggregation” to split the bars based on top two tweet authors per day gives us this:

You can hover over the legend to highlight the relevant bar block too:

Now with a nifty function in the Visualizer we can change the order of this question. So instead of, “by day, who were the top two tweeters”, we can ask “who were the top two tweeters over the time period, and what was their tweet count by day” – all just by rearranging the buckets/aggregation with a single click:

Let’s take another angle on the data, looking not at time but which blog links were most tweeted, and by whom. Turns out I’m a self-publicist, tweeting four times about my OOW article. Note that I’ve also including some filtering on my data to exclude automated tweets:

Broadening out the tweets to all those from accounts we were capturing during the sample we can see the most active tweeters, and also what proportion are original content vs retweets:

Turning our attention to the blog hits, it’s easy to break it down by top five articles in a period, accesses by day:

Having combined (dare I say, mashed up) post metadata with apache logs, we can overlay information about which author gets the most hits. Unsuprisingly Mark Rittman gets the lion’s share, but interestingly Venkat, who has not blogged for quite a while is still in the top three authors (based on blog page hits) in the time period analysed:

It’s in the current lack of a table visualisation that Kibana 4 is currently limited (although it is planned), because this analysis here (of the top three authors, what were their respective two most popular posts) just makes no sense as a graph:

but would be nice an easy to read off a table. You can access a table view of sorts from the arrow at the bottom of the screen, but this feels more like a debug option than an equal method for presenting the data

Whilst you can access the table on a dashboard, it doesn’t persist as the default option of the view, always showing the graph initially. As noted above, a table visualisation is planned and under development for Kibana 4.

Speaking of dashboards, Kibana 4 has a very nice dashboard builder with interactive resizing of objects both within rows and columns – quite a departure from Kibana 3 which has a rigid system of rows and panels:


Kibana 3 is great for properly analysing data and trends as you find them in the data, if you don’t mind working your way through the slightly rough interface. In contrast, Kibana 4 has a pretty slick UI but being an early beta is missing features like Term and Table from Kibana 3 that would enable tables of data as well as the pretty graphs. It’ll be great to see how it develops.

Putting the data in Elasticsearch makes it very fast to query. I’m doing this on a the Big Data Lite VM which admittedly is not very representative of a realworld Hadoop cluster but the relative speeds are interesting – dozens of seconds for any kind of Hive query, subsecond for any kind of Kibana/Elasticsearch query. The advantage of the latter of course being very interesting from a data exploration point of view, because you not only have the speed but also the visualisation and interactions with those visuals to dig and drill further into it.

Whilst Elasticsearch is extremely fast to query, I’ve not compared it to other options that are designed for speed (eg Impala) and which support a more standard interface, such as ODBC or JDBC so you can bring your own data visualisation tool (eg T-who-shall-not-be-named). In addition, there is the architectural consideration of Elasticsearch’s fit with the rest of the Hadoop stack. Whilst the elasticsearch-hadoop connector is two-way, I’m not sure if you would necessarily site your data in Elasticsearch alone, opting instead to duplicate all or part of it from somewhere like HDFS.

What would be interesting is to look at a similar analysis exercise using the updated Hue Search in CDH 5.2 which uses Apache Solr and therefore based on the same project as Elasticsearch (Apache Lucene). Another angle on this is Oracle’s forthcoming Big Data Discovery tool which also looks like it covers a similar purpose.

Analytics with Kibana and Elasticsearch through Hadoop – part 2 – Getting data into Elasticsearch

November 4th, 2014 by


In the first part of this series I described how I made several sets of data relating to the Rittman Mead blog from various sources available through Hive. This included blog hits from the Apache webserver log, tweets, and metadata from WordPress. Having got it into Hive I now need to get it into ElasticSearch as a pre-requisite for using Kibana to see how it holds up as a analysis tool or as a “data discovery” option. Here’s a reminder of the high-level architecture, with the parts that I’ve divided it up into covering over the three number of blog posts indicated:


In this article we will see how to go about doing that load into ElasticSearch, before getting into some hands-on with Kibana in the final article of this series.

Loading data from Hive to Elasticsearch

We need to get the data into Elasticsearch itself since that is where Kibana requires it to be for generating the visualisations. Elasticsearch holds the data and provides the analytics engine, and Kibana provides the visualisation rendering and the generation of queries into Elasticsearch. Kibana and Elasticsearch are the ‘E’ and ‘K’ of the ELK stack, which I have written about previously (the ‘L’ being Logstash but we’re not using that here).

Using the elasticsearch-hadoop connector we can load data exposed through Hive into Elasticsearch. It’s possible to load data directly from origin into Elasticsearch (using, for example, Logstash) but here we’re wanting to bring together several sets of data using Hadoop/Hive as the common point of integration.

Elasticsearch has a concept of an ‘index’ within which data is stored, held under a schema known as a ‘mapping’. Each index can have multiple mappings. It’s dead easy to run Elasticsearch – simply download it, unpack the archive, and then run it – it really is as easy as that:

You can load data directly across into Elasticsearch from Hive without having to prepare anything on Elasticsearch – it will create the index and mapping for you. But, for it to work how we want, we do need to specify the mapping in advance because we want to tell Elasticsearch two important things:

  • To treat the date field as a date – crucial for Kibana to do its time series-based magic
  • Not to “analyze” certain fields. By default Elasticsearch will analyze each string field so that you can display most common terms within it etc. However if we want to report things like blog title, breaking it down into individual words doesn’t make sense.

This means that the process is as follows:

  1. Define the Elasticsearch table in Hive
  2. Load a small sample of data into Elasticsearch from Hive
  3. Extract the mapping and amend the date field and mark required fields as non-analysed
  4. Load the new mapping definition to Elasticsearch
  5. Do a full load from Hive into Elasticsearch

Steps 2 and 3 can be sidestepped by crafting the mapping by hand from the outset but it’s typically quicker not to.

Before we can do anything in terms of shifting data around, we need to make elasticsearch-hadoop available to Hadoop. Download it from the github site, and copy the jar file to /usr/lib/hadoop and add it to HIVE_AUX_JARS_PATH in /usr/lib/hive/conf/hive-env.sh.

Defining the Hive table over Elasticsearch

The Hive definition for a table stored in Elasticsearch is pretty simple. Here’s a basic example of a table that’s going to hold a list of all blog posts made. Note the _es suffix, a convention I’m using to differentiate the Hive table from others with the same data and denoting that it’s in Elasticsearch (es). Also note the use of EXTERNAL as previously discussed, to stop Hive trashing the underlying data if you drop the Hive table:

The ROW FORMAT and STORED BY are standard, but the TBLPROPERTIES values should be explained (you’ll find full details in the manual):

  1. es.nodes – this is the hostname of the Elasticsearch server. If you have multiple nodes it will discover the others from this.
  2. es.resource – this is the index and mapping where the data should be stored. We’ll see more about these later, because they’re important.

Time for a tangent …

The biggest issue I had getting data from Hive into Elasticsearch was timestamps. To cut a very long story (involving lots of random jiggling, hi Christian!) short, I found it was easiest to convert timestamps into Unix epoch (number of seconds since Jan 1st 1970), rather than prat about with format strings (and prat about I did). For timestamps already matching the ISO8601 standard such as those in my WordPress data, I could leverage the Hive function UNIX_TIMESTAMP which returns exactly that

For others though that included the month name as text such as Wed, 17 Sep 2014 08:31:20 +0000 I had to write a very kludgy CASE statement to first switch the month names for numbers and then concatenate the whole lot into a ISO8601 that could be converted to unix epoch. This is why I also split the apache log SerDe so that it would bring in the timestamp components (time_dayDD, time_monthMMM, etc) individually, making the epoch conversion a little bit neater:

Because if you thought this was bad, check out what I had to do to the twitter timestamp:

As with a few things here, this was all for experimentation than streamlined production usage, so it probably could be rewritten more efficiently or solved in a better way – suggestions welcome!

So the nett result of all of these is the timestamp as epoch in seconds – but note that Elasticsearch works with millisecond epoch, so they all need multiplying by 1000.

As I’ve noted above, this feels more complex than it needed to have been, and maybe with a bit more perseverence I could have got it to work without resorting to epoch. The issue I continued to hit with passing timestamps across as non-epoch values (i.e. as strings using the format option of the Elasticsearch mapping definition, or Hive Timestamp, and even specifying es.mapping.timestamp) was org.elasticsearch.hadoop.rest.EsHadoopInvalidRequest: TimestampParsingException, regardless of the careful format masks that I applied.

Back on track – loading a sample row into Elasticsearch

We want to send a sample row of data to Elasticsearch now for two reasons:

  1. As a canary to prove the “plumbing” – no point chucking thousands of rows across through MapReduce if it’s going to fall over for a simple problem (I learnt my lesson during the timestamp fiddling above).
  2. Automagically generate the Elasticsearch mapping, which we subsequently need to modify by hand and is easier if it’s been created for us first.

Since the table is defined in Hive, we can just run a straightforward INSERT to send some data across, making use of the LIMIT clause of HiveQL to just send a couple of rows:

Hive will generate a MapReduce job that pushes the resulting data over to Elasticsearch. You can see the log for the job – essential for troubleshooting – at /var/log/hive/hive-server2.log (by default). In this snippet you can see a successful completion:

But if you’ve a problem with your setup you’ll most likely see this generic error instead passed back to beeline prompt:

Meaning that you need to go to the Hive log file for the full diagnostics.

Amending the Elasticsearch mapping

So assuming the previous step worked (if you got the innocuous No rows affected from beeline then it did) you now have an index and mapping (and a couple of “documents” of data) in Elasticsearch. You can inspect the mapping in several ways, including with the GUI for Elasticsearch admin kopf.

You can also interogate Elasticsearch directly with its REST API, which is what we’re going to use to update the mapping so let’s use it also to view it. I’m going to use curl to do the HTTP call, and then pipe it | straight to jq to prettify the resulting JSON that Elasticsearch sends back.

We can see from this that Elasticsearch has generated the mapping to match the data that we’ve sent across from Hive (note how it’s picked up the ts_epoch type as being numeric not string, per our Hive table DDL). But, as mentioned previously, there are two things we need to rectify here:

  1. ts_epoch needs to be a date type, not long. Without the correct type, Kibana won’t recognise it as a date field.
  2. Fields that we don’t want broken down for analysis need marking as such. We’ll see the real difference that this makes when we get on to Kibana later.

To amend the mapping we just take the JSON document, make the changes, and then push it back with curl again. You can use any editor with the JSON (I’ve found Atom on the Mac to be great for its syntax highlighting, brace matching, etc). To change the type of the date field just change long to date. To mark a field not for analysis add "index": "not_analyzed" to the column definition. After these changes, the amended fields in my mapping JSON look like this:

The particularly eagle-eyed of you will notice that I am loading post_title in twice. This is because I want to use the field both as a label but also to analyse it as a field itself, looking at which terms get used most. So in the updated mapping, only post_title is set to not_analyzed; the post_title_a is left alone.

To remove the existing mapping, use this API call:

and then the amended mapping put back. Note that the "all_blog" / "mappings" outer levels of the JSON have been removed from the JSON that we send back to Elasticsearch:

Full load into Elasticsearch

Now we can go ahead and run a full INSERT from Hive, and this time the existing mapping will be used. Depending on how much data you’re loading, it might take a while but you can always tail the hive-server2.log file to monitor progress. So that we don’t duplicate the ‘canary’ data that we sent across, use the INSERT OVERWRITE statement:

To check the data’s made it across we can do a count from Hive:

But this requires a MapReduce job to run and is fairly slow. Much faster is direct from the horse’s mouth – from Elasticsearch itself where the data is. Just as we called a REST API to get and set the mapping, Elasticsearch can also give us statistics back this way too:

Here I’ve used a bit more jq to parse down the stats in JSON that Elasticsearch sends back. If you want to explore more of what jq can do, you’ll find https://jqplay.org/ useful.


For reference, here is the set of three curl/DDL/DML that I used:

  • Elasticsearch index mappings

  • Hive table DDL

  • Hive DML – load data to Elasticsearch


With the data loaded into Elasticsearch we’re now ready to start our analysis against it. Stay tuned for the final part in this short blog series to see how we use Kibana to do this.

Analytics with Kibana and Elasticsearch through Hadoop – part 1 – Introduction

November 3rd, 2014 by


I’ve recently started learning more about the tools and technologies that fall under the loose umbrella term of Big Data, following a lot of the blogs that Mark Rittman has written, including getting Apache log data into Hadoop, and bringing Twitter data into Hadoop via Mongodb.

What I wanted to do was visualise the data I’d brought in, looking for patterns and correlations. Obviously the de facto choice at our shop would be Oracle BI, which Mark previously demonstrated reporting on data in Hadoop through Hive and Impala. But, this was more at the “Data Discovery” phase that is discussed in the new Information Management and Big Data Reference Architecture that Rittman Mead helped write with Oracle. I basically wanted a quick and dirty way to start chucking around columns of data without yet being ready to impose the structure of the OBIEE metadata model on it. One of the tools I’ve worked with recently is a visualisation tool called Kibana which is part of the ELK stack (that I wrote about previously for use in building a monitoring solution for OBIEE). In this article we’ll take a look at making data available to Kibana and then the kind of analytics and visualisations you can do with it. In addition, we’ll see how loading the data into ElasticSearch has the benefit of extremely fast query times compared to through Hive alone.

The Data

I’ve got three sources of data I’m going to work with, all related to the Rittman Mead website:

  • Website logs, from Apache webserver
  • Tweets about Rittman Mead blog articles, via Datasift
  • Metadata about blog posts, extracted from the WordPress MySQL database

At the moment I’ve focussed on just getting the data in, so it’s mostly coming from static files, with the exception of the tweets which are held in a noSQL database (MongoDB).

The Tools

This is where ‘big data’ gets fun, because instead of “Acme DI” and “Acme Database” and “Acme BI”, we have the much more interesting – if somewhat silly – naming conventions of the whackier the better. Here I’m using:

  • Kibana – data visualisation tool for Elasticsearch
  • Elasticsearch – data store & analytics / search engine
  • HDFS – Hadoop’s distributed file system
  • MongoDB – NoSQL database
  • Hive – enables querying data held in various places including HDFS (and Elasticsearch, and MongoDB) with a SQL-like query language
  • Beeline – Hive command line interface
  • Datasift – online service that streams tweets matching a given pattern to a nominated datastore (such as MongoDB)
  • mongo-hadoop – a connector for MongoDB to Hadoop including Hive
  • elasticsearch-hadoop – a connector for Elasticsearch to Hadoop including Hive

Kibana only queries data held in Elasticsearch, which acts as both the data store and the analytics engine. There are various ways to get data into Elasticsearch directly from source but I’ve opted not to do that here, instead bringing it all in via HDFS and Hive. I’ve done that because my – albeit fairly limited – experience is that Elasticsearch is great once you’ve settled on your data and schema, but in the same way I’m not building a full OBIEE metadata model (RPD) yet, nor did I want to design my Elasticsearch schema up front and have to reload from source if it changed. Options for reprocessing and wrangling data once in Elasticsearch seem limited and complex, and by making all my data available through Hive first I could supplement it and mash it up as I wanted, loading it into Elasticsearch only when I had a chunk of data to explore. Another approach that I haven’t tried but could be useful if the requirement fits it would be to load the individual data elements directly into their own Elasticsearch area and then using the elasticsearch-hadoop connector run the required mashups with other data through Hive, loading the results back into Elasticsearch. It all depends on where you’re coming from with the data.


Here’s a diagram of what I’m building:

I’ll explain it in steps as follows:

  1. Loading the data and making it accessible through Hive
  2. Loading data from Hive to Elasticsearch
  3. Visualising and analysing data in Kibana

Getting the data into Hive

Strictly speaking we’re not getting the data into Hive, so much as making it available through Hive. Hive simply enables you to define and query tables sitting on top of data held in places including HDFS. The beauty of the Hadoop ecosystem is that you can physicalise data in a bunch of tools and the components will most often support interoperability with each other. It’s only when you get started playing with it that you realise how powerful this is.

The Apache log files and WordPress metadata suit themselves fairly well to a traditional RDBMS format of [de]normalised tables, so we can store them in HDFS with simple RDBMS tables defined on top through Hive. But the twitter data comes in JSON format (like this), and if we were going to store the Twitter data in a traditional RDBMS we’d have to work out how to explode the document into a normalised schema, catering for varying structures depending on the type of tweet and data payload within it. At the moment we just want to collect all the data that looks useful, and then look at different ways to analyse it afterwards. Instead of having to compromise one way (force a structure over the variable JSON) or another (not put a relational schema over obviously relational data) we can do both, and decide at run-time how to best use it. From there, we can identify important bits of data and refactor our design as necessary. This “schema on read” approach is one of the real essences of Hadoop and ‘big data’ in general.

So with that said, let’s see how we get the data in. This bit is the easy part of the article to write, because a lot of it is pretty much what Mark Rittman has already written up in his articles, so I’ll refer to those rather than duplicate here.

Apache log data


I’ve used a variation on the standard Apache log SerDe that the interwebs offers, because I’m going to need to work with the timestamp quite closely (we’ll see why later) so I’ve burst it out into individual fields.

The DDL is:

The EXTERNAL is important on the table definition as it stops Hive moving the HDFS files into its own area on HDFS. If Hive does move the files it is annoying if you want to also access them through another program (or Hive table), and downright destructive if you DROP the table since it’ll delete the HDFS files too – unless it’s EXTERNAL. Note the LOCATION must be an HDFS folder, even if it just holds one file.

For building and testing the SerDe regex Rubular is most excellent, but note that it’s Java regex you’re specifying in the SerDe which has its differences from Python or Ruby regex that Rubular (and most other online regex testers) support. For the final validation of Java regex I use the slightly ugly but still useful regexplanet, which also gives you the fully escaped version of your regex which you’ll need to use for the actual Hive DDL/DML.

A sample row from the apache log on disk looks like this:

and now in Hive:

Twitter data


The twitter data we’ve got includes the Hive ARRAY datatype for the collections of hashtag(s) and referenced url(s) from within a tweet. A point to note here is that the author_followers data appears in different locations of the JSON document depending on whether it’s a retweet or not. I ended up with two variations of this table and a UNION on top.

The table is mapped on data held in MongoDB and as with the HDFS data above the EXTERNAL is crucial to ensure you don’t trash your data when you drop your table.

The other point to note is that we’re now using mongo-hadoop for Hive to connect to MongoDB. I found that I had to first build the full set of jar files by running ./gradlew jar -PclusterVersion='cdh5', and also download the MongoDB java driver, before copying the whole lot into /usr/lib/hadoop/lib. This is what I had by the end of it:

After all that, the data as it appears in Hive looks like this:

For reference, without the mongo-hadoop connectors I was getting the error

and with them installed but without the MongoDB java driver I got:

WordPress metadata

WordPress holds its metadata in a MySQL database, so it’s easy to extract out:

  1. Run a query in MySQL to generate the CSV export files, such as:

  2. Copy the CSV file to your Hadoop machine, and copy it onto HDFS. Make sure each type of data goes in its own HDFS folder:

  3. Define the Hive table on top of it:

Rinse & repeat for the category data, and post->category relationships.

The data once modelled in Hive looks like this:

The WordPress metadata quite obviously joins together, as it is already from the relational schema in which it was held on MySQL. Here is an example of where “schema on read” comes into play, because you could look at the above three tables (posts / post_cats / categories) and conclude it was redundant to export all three from WordPress and instead a single query listings posts and their respective category would be sufficient. But, some posts have more than one category, which then leads to a design/requirements decision. Either we retain one row per post – and collapse down the categories, but in doing so lose ability to easily treat categories as individual data – or have one row per post/category, and end up with multiple rows per post which if we’re doing a simple count of posts complicates matters. So we bring it in all raw from source, and then decide how we’re going to use it afterwards.

Bringing the data together

At this point I have six tables in Hive that I can query (albeit slowly) with HiveQL, a close relation to SQL with a few interesting differences running through the Hive client Beeline. The data is tweets, website visits, and details about the blog posts themselves.

As well as time, the other common element running throughout all the data is the blog article URL, whether it is a post, a visit to the website, or a tweet about it. But to join on it is not quite as simple as you’d hope, because all the following are examples of recorded instances of the data for the same blog post:

So whether it’s querying the data within Hive, or loading it joined together to another platform, we need to be able to unify the values of this field.

Tangent: RegEx

And now it’s time, if you’d not already for your SerDe against the Apache file, to really immerse yourself in Regular Expressions (RegEx). Part of the “schema on read” approach is that it can get messy. You need to juggle and wrangle and munge data in ways that it really might not want to, and RegEx is an essential tool with which to do this. Regex isn’t specific to Hadoop – it’s used throughout the computing world.

My journey with regex over quite a few years in computing has gone in stages something like this:

  1. To be a fully rounded geek, I should learn regex. Looks up regex. Hmm, looks complicated….Squirrel!
    1. To be a fully round (geddit?!) geek, I should keep eating these big breakfasts
  2. I’ve got a problem, I’ve got a feeling regex will help me. But my word it looks complicated … I’ll just do it by hand.
  3. I’ve got another problem, I need to find this text in a file but with certain patterns around it. Here’s a regex I found on google. Neat!
  4. Hmmm another text matching problem, maybe I should really learn regex instead of googling it to death each time
  5. Mastered the basic concepts of regex
  6. Still a long way to go…

If you think you’ll nail RegEx overnight, you won’t (or at least, you’re a better geek than me). It’s one of those techniques, maybe a bit like SQL, that to fully grok takes a period of exposure and gradually increasing usage, before you have an “ah hah!” moment. There’s a great site explaining regex here: www.regular-expressions.info. My best advice is to take a real example text that you want to work with (match on, replace bits of, etc), and stick it in one of these parsers and experiment with the code:

Oh and finally, watch out for variations in regex – what works in a Java-based program (most of the Hadoop world) may not in Python and visa versa. Same goes for PHP, Ruby, and so on – they all have different regex engines that may or may not behave as you’d expect.

Back on track : joining data on non-matching columns

So to recap, we want to be able to analyse our blog data across tweets, site hits and postings, using the common field of the post URL, which from the various sources can look like any of the following (and more):

So out comes the RegEx. First off, we’ll do the easy one – strip the http:// and server bit. Using the Hive function REGEXP_REPLACE we can use this in the query:

This means, take the ref_url column and if you find http://www.rittmanmead.com then replace it with nothing, i.e. delete it. The two backslashes before each forward slash simply escape them since a forward slash on its own has a special meaning in regex. Just to keep you on your toes – Java regex requires double backspace escaping, but all other regex (including the online parser I link to below) uses a single one.

So now our list possible join candidates has shrunk by one to look like this:

The variation as you can see is whether there is a trailing forward slash (/) after the post ‘slug’ , and whether there is additional cruft after that too (feed, foobar+foorbar, etc). So let’s build it up a piece at a time. On each one, I’ve linked to an online parser that you can use to see it in action.

  1. We’ll match on the year and month (/2014/01/) because they’re fixed pattern, so using \d to match on digits and {x} to match x repetitions: (see example on Rubular.com)

    This will match /2014/01/.

  2. Now we need to match the slug, but we’re going to ditch the forward slash suffix if there is one. This is done with two steps.

    First, we define a “match anything except x” group, which is what the square brackets (group) and the caret ^ (negate) do, and in this case x is the forward slash character, escaped.

    Secondly, the plus symbol + tells regex to match at least one repetitions of the preceeding group – i.e. any character that is not a forward slash. (example)

    Combined with the above regex from the first step we will now match /2014/01/automated-regression-testing-for-obiee.

  3. The final step is to turn the previous REGEXP_REPLACE on its head and instead of replacing out content from the string that we don’t want, instead we’ll extract the content that we do want, using a regex capture group which is defined by regular brackets (parantheses, just like these). We’ve now brought in a couple of extra bits to make it hang together, seen in the completed regex here:

    1. The \S* at the beginning means match any non-whitespace character, which will replace the previous regex replace we were doing to strip out the http://www.rittmanmead.com
    2. After the capture group, which is the content from steps one and two above, surround by parentheses (\/\d{4}\/\d{2}\/[^\/]+) there is a final .* to match anything else that might be present (eg trailing forward slash, foobar, etc etc)

    Now all we need to do is escape it for Java regex, and stick it in the Hive REGEXP_EXTRACT function, specifying 1 as the capture group number to extract: (example)

So now all our URLs will look like this, regardless of whether they’re from tweet data, website hits, or wordpress:

Which is nice, because it means we can use it as the common join in our queries. For example, to look up the title of the blog post that someone has tweeted about, and who wrote the post:

Note here also the use of LATERAL VIEW EXPLODE () as a way of denormalising out the Hive ARRAY of referenced url(s) in the tweet so there is one row returned per value.


We’ve got our three sources of data available to us in Hive, and can query across them. Next we’ll take a look at loading the data into Elasticsearch, taking advantage of our conformed url column to join data that we load. Stay tuned!

Website Design & Build: tymedia.co.uk