Over the years, I’ve blogged quite a bit about integration between Oracle Data Integrator and GoldenGate, and how to make it all work with the Oracle Reference Architecture. With the release of the 12c versions of ODI and GoldenGate last October, and a soon-to-be-updated reference architecture, it’s time to write a few posts on the subject again.
Getting Started with 12c
First, let me describe the new Journalizing Knowledge Module (JKM) that has been introduced in ODI 12c for integration with GoldenGate: JKM Oracle to Oracle Consistent (OGG Online). This JKM now allows GoldenGate to be setup in an “online” mode, meaning the GoldenGate parameter files and process groups will be configured and installed on the source and target GoldenGate servers. ODI communicates with the GoldenGate JAgent to perform the installation. The “offline” mode still exists as it did in the 11g version of the JKM, in which the parameter files, etc. are created in a temporary location, and then manually moved to the source and target. I’ll use the “online” JKM throughout this series of posts.
Another change to how Journalizing is implemented for GoldenGate in ODI is the Model to which the JKM is applied. In ODI 11g, the GoldenGate JKM was always applied to the Model containing the target tables, leaving the source table metadata completely out of the picture. This made sense, as GoldenGate handled everything on the source side. Now in ODI 12c, the source tables are reverse engineered and the JKM applied to the source Model. This allows the source table to be used in a single mapping for both the initial load and incremental load of the performance layer through the use of ODI 12c deployment specifications. The target, or fully replicated table, is no longer necessary in the metadata. We’ll talk through this concept in more detail later on.
Reference Architecture Update
Finally, before we get into the details, let’s go over the latest (yet to be released, I might add) version of the Oracle Information Management Reference Architecture. It was first presented by Stewart Bryson and Andrew Bond (Oracle) at the Rittman Mead BI Forum in Brighton (which is why I was given the OK to mention it pre-release!).
This latest reference architecture is not much different than previous versions. The main difference that pertains to this blog post is that the Staging Layer has been renamed the Raw Data Reservoir. If you look through the presentation by Stewart and Andrew, you’ll see that the many of the principles remain the same. They also describe more about an Agile approach to implementing the reference architecture, using GoldenGate, OBIEE against transactional schemas, and eventually ETL development using ODI, a methodology we here at Rittman Mead use with our clients, and call ExtremeBI. Look for the official release of the latest Information Management Reference Architecture in the next couple of months.
In this blog series, we’ll look at how to load the Raw Data Reservoir and Foundation Layer using GoldenGate, and subsequently load the Access and Performance Layer with Oracle Data Integrator Mappings. If you recall from my 11g posts on the subject, we don’t need to load these layers in sequence. GoldenGate will allow the extract of source data once and replication to multiple targets in parallel.
Now that we’ve gone through some of the updated concepts for 12c and the Reference Architecture, let’s look at the high-level steps that must be taken in order to implement GoldenGate and ODI integration.
Install GoldenGate on the source and target servers – including JAgent configuration
Edit the “JKM Oracle to Oracle Consistent OGG (Online)” Knowledge Module (enable Foundation Layer load)
Setup ODI Topology (database schema and GoldenGate connections)
Setup and start Journalizing on the source Model
Develop Mappings for initial load and incremental load
Perform initial load and start replication
There is quite a lot of detail to add to these steps, so let’s get right into it.
GoldenGate 12c Installation and JAgent Configuration
The install of GoldenGate 12c is pretty straight-forward, so I don’t plan on going into much detail here. A step-by-step guide can be found on the DBASolved blog, with the installation setting up and starting the manager process and creating the necessary subdirectories. We then need to configure the JAgent on both the source and target GoldenGate installations, enabling ODI to communicate with GoldenGate during the “online” JKM start journalizing process, which will automatically configure and start the GoldenGate process groups. Setting up the JAgent for ODI integration is essentially the same as if you were setting up Oracle Enterprise Manager integration with GoldenGate.
First, you’ll notice that the file jagent.prm exists in the dirprm directory after installation completes. This parameter file will be used by the jagent process once started in GGSCI.
Next, we need to enable monitoring of GoldenGate by adding an entry, ENABLEMONITORING, to the GLOBALS file. Create the GLOBALS file (with no extension) in the GoldenGate home directory and open it in your favorite text editor. Simply add the line to enable monitoring, close and save the file.
To allow secure communication between ODI and GoldenGate, we must create an Oracle Wallet with a password for JAgent. From the GoldenGate install directory, run the password agent. This will create the cwallet.sso file in the dirwlt subdirectory.
We’re almost there! Now we need to make a slight change to the Config.properties file in the cfg directory under the GoldenGate home. Edit this file and make the following changes:
Set the agent type to Oracle Enterprise Manager. If left as the default OGGMON, the JAgent will attempt to register with the Monitor server, which most likely is not installed.
Under the JMX Username, add the line to signify that SSL is not being used by JAgent (unless SSL is actually being used!).
Finally, ensure the JAgent port is unique across all installations, and on the server in general. In my example, I’m using a single Virtual Machine to host both the source and target GoldenGate installations, so I need to be careful about which ports are in use.
Before starting the JAgent, go ahead and stop, then start the Manager process to ensure all changes have been initialized. Then, start the JAgent and check to ensure both Manager and JAgent are running. That completes the GoldenGate installation and JAgent configuration.
If you want to just skip all of this installation work and get right to it, you can always download the Prebuilt Machine for Oracle Data Integrator 12c. It’s a VirtualBox VM with ODI 12c and GoldenGate 12c already installed and configured to work with the Getting Started Guide. The JAgent is already configured on the source and target GoldenGate installations, making it easy to get up and running. This is a great resource that the Oracle Data Integration product team has provided, and it sounds like they plan to continue adding to it in the future.
In Part 2 of the blog post series, we’ll edit the JKM to enable parallel load of the Raw Data Reservoir and Foundation Layer, as well as begin setup of the ODI Topology and metadata.
Earlier this year I wrote a couple of blogs posts (here and here) discussing the topic of automated Regression Testing and OBIEE. One of the points that I was keen make was that OBIEE is a stack of elements and depending on the change being tested, it may be sensible to focus on certain elements in the stack instead of all of it. For example, if you are changing the RPD, there is little value in doing a web-based test when you can actually test for the vast majority of regressions using the nqcmd tool alone.
I also argued that testing the front end of OBIEE using tools such as Selenium is difficult to do comprehensively, it can be inflexible, time-consuming and in some cases just not a sensible use of effort. These tools work around the idea of parsing the web page that is served up and checking for presence (or absence) of a particular piece of text or an element on a web page. So for example, you could run a test and tell it to fail if it finds the text “Error” on the page, or you could say only pass the test if some known-content is present, such as a report title or data figure. This type of testing is prone to a great deal of false-negatives, because to efficiently build any kind of test case you must focus on something to check for in the page, but you cannot code for every possible error or failure. It is also usually based heavily on the internal IDs of elements on the page in locating the ‘something’ to check for. As the OBIEE Document Object Model (DOM) is undocumented code, Oracle are at presumably at liberty to change it whenever they feel like it, and thus any tests written based on it may fail. Finally, OBIEE 11g still defaults to serving up graphs as Flash objects, which Selenium et al just cannot handle, and so cannot be tested.
So, what do we do about regression testing the OBIEE front end?
What do we need to test in the front end?
There is still a strong case for regression testing the OBIEE front end. Analyses get changed, Dashboards break, permissions are updated – all these things can cause errors or problems for the end user, but which are something that testing further down the OBIEE stack (using something like nqcmd) will not cover.
Consider a simple dashboard:
If one of the dashboard pages that are linked to in the central section get moved in the Presentation Catalog, then this happens:
OK, so Invalid Link Path: is pretty easy to code in as an error check into Selenium. But, what about if the permissions on an analysis used in the dashboard get changed and the user can no longer access it when running the dashboard?
This is a different problem altogether. We need to check for the absence of something. There’s no error, there just isn’t the analysis that ought to be present. One way around this would be to code for the presence of the analysis title text or content – but that is not going to scale nor be maintainable to do for every dashboard being tested.
Another thing that is important to check in the front end is that authorisations are enforced as they should be. That is, a user can see the dashboards that they should be able to, and that they cannot see the ones they’re not. Changes made in the LDAP directory holding users and their groups, or a configuration change in the Application Roles, could easily mean that a user can no longer see the dashboards they should be able to. We could code for this specific issue using something like Web Services to programatically check each and every actual permission – but that could well be overkill.
What I would like to introduce here is the idea of testing OBIEE for regressions visually - but automated, of course.
Visual Regression Testing
Driven by the huge number of applications that are accessed solely on the web (sorry, “Cloud”), a new set of tools have been developed to support the idea of testing web pages for regressions visually. Instead of ‘explaining’ to the computer specifically what to look for in a page (no error text, etc), visual regression testing uses a process to compare images of a web page, comparing a baseline to a sample taken afterwards. This means that the number of false-negatives (missing genuine errors because the test didn’t detect them) drops drastically because instead of relying on coding a test program to parse the Document Object Model (DOM) of an OBIEE web page (which is extremely complex), instead it is simply considering if two snapshots of the resulting rendered page look the same.
The second real advantage of this method is that typically the tools (including the one I have been working with and will demonstrate below, PhantomCSS) are based on the actual engine that drives the web browsers in use by real end-users. So it’s not a case of parsing the HTML and CSS that the web server sends us and trying to determine if there’s a problem or not – it is actually rendering it the same as Chrome etc and taking a snapshot of it. PhantomCSS uses PhantomJS, which uses the engine that Safari is built on, WebKit.
Let’s see this scripting in action. Some clarification of the programs we’re going to use first:
CasperJS provides a set of APIs on top of PhantomJS that make working with web page forms, navigation etc much easier
PhantomCSS provides the regression testing bit, taking snapshots and running code to compare them and report differences.
We’ll consider a simple CasperJS example first, and come on to PhantomCSS after. Because PhantomCSS uses CasperJS for its core interactions, it makes sense to start with the basics.
Here is a bare-bones script. It loads the login page for OBIEE, echoes the page title to the console, takes a snapshot, and exits:
The only bit of the script above that isn’t self-explanatory is where I am referencing elements. The references are as CSS3 selectors and are easily found using something like Chrome Developer Tools. Where the click on Dashboards is simulated, there is a waitUntilVisible function, which is crucial for making sure that the page has rendered fully. For a user clicking the menu, they’d obviously wait until it appears but computers work much faster so functions like this are important for reining them back.
To round off the CasperJS script, let’s add to the above navigating to a Dashboard, snapshotting it (with graphs!), and then logging out.
Again, there’s a couple of waitWhileVisible functions in there, necessary to get CasperJS to wait until the dashboard has rendered properly. The dashboard rendered is captured thus:
So now let’s see how we can use the above CasperJS code in conjunction with PhantomCSS to generate a viable regression test scenario for OBIEE.
The script remains pretty much the same, except CasperJS’s capture gets replaced with a phantomcss.screenshot based on an element (html for the whole page), and there’s some extra code “footer” to include that executes the actual test.
So let’s see how the proposed test method holds up to the examples above – broken links and disappearing reports.
First, we run the baseline capture, the “known good”. The console output shows that this is the first time it’s been run, because there are no existing images against which to compare:
In the screenshots folder is the ‘baseline’ image for each of the defined snapshots:
Now let’s break something! First off I’ll rename the target page for one of the links in the central pane of the dashboard, which will cause the ‘Invalid Link Path’ message to display.
Now I run the same PhantomCSS test again, and this time it tells me there’s a problem:
When an image is found to differ, a composite of the two highlighting the differences is created:
OK, so first test passed (or rather, failed), but arguably this could have been picked up simply by parsing the page returned from the OBIEE server for known error strings. But what about a disappearing analysis – that’s more difficult to ascertain from the page source alone.
Again, PhantomCSS picks up the difference, and highlights it nice and clearly in the generated image:
For the baseline image that you capture it would be against a “gold” version of a dashboard – no point including ad-hoc reports or dashboards under development. You’d also want to work with data that was unchanging, so where available a time filter fixed at a point in the past, rather than ‘current day’ which will be changing frequently.
Belts and Braces?
So visual regression testing is a great thing, but I think a hybrid approach, of parsing the page contents for text too, is worthwhile. CasperJS provides its own test APIs (which PhantomCSS uses), and we can write simple tests such as the following:
this.test.assertTextDoesntExist('Invalid Link Path','Check for error text on page');
this.test.assertTextDoesntExist('View Display Error','Check for error text on page');
So check for a couple of well-known errors, and then snapshot the page too for subsequent automatic comparison. If an assertion is failed, it shows in the console:
This means that what is already be being done in Selenium (or for which Selenium is an assumed default tool) could even be brought into the same single test rig based around CasperJS/PhantomCSS.
Frame of Reference
The eagle-eyed of you will have noticed that the snapshots generated by PhantomCSS above are not the entire OBIEE webpage, whereas the ones from CasperJS earlier in this article are. That is because PhantomCSS deliberately wants to focus on an area of the page to test, identified using a CSS3 selector. So if you are testing a dashboard, then considering the toolbar is irrelevant and can only lead to false-positives.
Similarly, considering the available dashboard list (to validate enforced authorisations) just needs to look at the list itself, not the rest of the page. (and yes, that does say “Protals” – even developers have fat fingers sometimes ;-) )
Using this functionality means that the generated snapshots used for comparison can be done to exclude things like the alerts bar (which may appear or disappear between tests).
The Devil’s in the Detail
I am in no doubt that the method described above has definitely got its place in the regression testing arsenal for OBIEE. What I am yet to be fully convinced of is quite to what extent. My beef with Selenium et al is the level of detail one has to get in to when writing tests – identifying strings to test for, their location in the DOM, and so on. Yet above in my CasperJS/PhantomCSS examples, I have DOM selectors too, so is this just the same problem? At the moment, I don’t think so. For Selenium, to build a comprehensive test, you have to dissect the DOM for every single test you want to build. Whereas with CasperJS/PhantomCSS I think there is the need to write a basic framework for OBIEE (the basics of which are provided in this post; you’re welcome), which can then be parameterised based on dashboard name and page only. Sure, additional types of tests may need new code, but it would be more reusable.
Given that OBIEE doesn’t come with an out of the box test rig, whatever we build to test it is going to be bespoke, whether its nqcmd, Selenium, JMeter, LoadRunner, OATS, QTP, etc etc — the smart money is picking the option that will be the most flexible, more scalable, easiest to maintain, and take the least effort to develop. There is no one “program to rule them all” – an accurate, comprehensive, and flexible test suite is invariably going to utilise multiple components focussing on different areas.
In the case of regression testing – what is the aim of the testing? What are you looking to validate hasn’t broken after what kind of change? If all that’s changed in the system is the DBAs adding some indexes or partitioning to the data, I really would not be going anywhere near the front end of OBIEE. However, more complex changes affecting the Presentation Catalog and the RPD can be well covered by this technique in conjunction with nqcmd. Visual regression testing will give you a pass/fail, but then it’s up to you to decipher the images, whereas nqcmd will give you a pass/fail but also an actual set of data to show what has changed.
Don’t forget that other great tool — you! Or rather, you and your minions, who can sit at OBIEE for 5 minutes and spot certain regressions that would take magnitudes of order greater in time to build a test to locate. Things like testing for UI/UX changes between OBIEE versions is something that is realistically handled manually. The testing of the dashboards can be automated, but faster than I can even type the requirement, let alone build a test to validate it – does clicking on the save icon bring up the save box? Well go click for yourself – done? Next test.
I have just scratched the surface of what is possible with headless browser scripting for testing OBIEE. Being able to automate and capture the results of browser interactions as we’ve seen above is hugely powerful. You can find the CasperJS API reference here if you want to find out more about how it is possible to interact with the web page as a “user”.
I’ve put the complete PhantomCSS script online here. Let me know in the comments section or via twitter if you do try it out!
The other day I posted an article on the blog around using Flume to transport Apache web log entries from our website into Hadoop, with the final destination for the entries being an HDFS file – with the HDFS file essentially mirroring the contents of the webserver log file. Once you’ve set this transport mechanism up, you could create a Hive table over the HDFS files, for example, or further transform the data using Pig, Spark or some other mechanism.
When you load data into HDFS files though, there are a couple of things you need to be aware of; HDFS is optimised for large, streaming reads of files stored in very large disk blocks, with the classic use-case being MapReduce transformations that crunch large sets of incoming data and hand-off the results to another process. What it’s not good at is random retrievals of single file records, something you’ll notice if you try and return a single row from a Hive table request. Moreover, HDFS files are write-once, no updates or overwrites, which is why Hive only supports SELECTS and not UPDATES or DELETES. Altogether, whilst HDFS is great for landing and then processing large chunks of data, if you’re looking for more granular, database-type storage on Hadoop, you’ll need to think of something else.
And within the context of Cloudera Hadoop, that other thing is HBase, a “NoSQL” database that’s also open-source and runs on the Hadoop framework. Whilst you can work with HBase in similar ways to how you work with relational databases – you can create columns, load data into it, insert and update data and so forth – HBase and NoSQL are in lots of ways the complete opposite of relational databases like Oracle Database, as they trade-off things we normally take for granted but that have performance and scalability impacts – ACID transactions, the ability to support complex table relationships, very rich query languages and application support – for extreme scalability and flexibility. If you’re scared of losing your data then HBase is one of the better NoSQL databases, with strong (rather than “eventual”) consistency, automatic shading and lots of high-availability features, but it’s not designed for running your payroll (yet).
One reason we might want to land data in HBase or another NoSQL database, rather than in regular HDFS files, is if we then want to do fast individual record lookups within the landed data. Another reason would be HBase’s support for complex record types, making it easy to store for example nested XML datasets, and its ability – like the Endeca Server – to hold completely different sets of “columns” for each row in the database, and even version those rows giving us almost a “multi-dimensional” database. Internally, HBase stores data as key-value pairs giving it the ability to hold completely different data in each database row, and under the covers HBase data is in turn stored in indexed “StoreFiles” within HDFS, giving it HDFS’s scalability and access to the Hadoop framework, but adding fast random access to individual records.
Where HBase (and most NoSQL databases) get complicated though is that there’s no SQL*Developer or TOAD to create tables, and no SQL or PL/SQL to load and manipulate them – it’s all done through Java and custom code – this article by Lars George who gave the Hadoop Masterclass as last week’s BI Forum goes into a bit more detail, along with his HBase slides and his book, “HBase: The Definitive Guide”.
So let’s look at a simple example of loading Apache CombinedLogFormat log file entries into HBase, using Flume to transport and ingest the data from our webserver into Hadoop and put together again by Nelio Guimaraes from the RM team. We’ll start by defining the HBase table, which like regular relational tables has rows but which has the concept of column families and column qualifiers rather than just columns. In practice, a column family + qualifier name makes what we’d normally think of as a column, but crucially under the covers column within families are stored together on disk, like column-store relational databases, making them fast to query and randomly access. Like a spreadsheet or OLAP database each combination of row and column family/qualifier is called a “cell”, and moreover only populated cells are stored on disk, with the added bonus of cell entries being timestamped, giving us the ability to retrieve previous versions of cell entries, like the temporal query feature in Oracle Database 12c.
For more details on how HBase stores data, and how HBase schemas are defined, the white paper “Introduction to HBase Schema Design” by Cloudera’s Amandeep Khurana is a good reference point and introduction. So let’s go into the HBase shell and create a table to contain our log data; we’ll define as containing three column families (“common”,”http” and “misc”), with the actual column qualifiers defined at the point we load data into the table – one of the key features of HBase, and most NoSQL databases, is that you can introduce new columns into a store at the point of loading, just by declaring them, with each row potentially containing its own unique selection of columns – which is where Endeca Server gets its ability to store “jagged datasets” with potentially different attribute sets held for groups of rows.
HBase Shell;enter'help<RETURN>'forlist of supported commands.
Type"exit<RETURN>"toleave the HBase Shell
In this example, the way we’re going to populate the HBase table is to use Flume; like the Flume and HDFS example the other day, we’ll use a “sink”, in this case a HBase sink, to take the incoming Flume activity off the channel and load it into the HBase table. Flume actually has two HBase sinks; one called HBaseSink which writes synchronously (more straightforward but slower) and another called AysncHBaseSink which writes asynchronously, potentially with higher overall throughput than synchronous writes and with full consistency even if there’s a failure (based on replaying the channel data), but with a slightly more complex serialisation approach. We’ll use the asynchronous sink in this example, and assuming you’ve already got the source configuration file set-up (see the previous blog post on Flume and HDFS for an example), the target Flume conf file in our case looked like this:
The collector.sinks.HbaseOut.type setting determines the sink type we’ll use, in this case org.apache.flume.sink.hbase.AsyncHBaseSink
collector.sinks.HbaseOut.table sets the HBase table name we’ll load, “apache_access_log”
collector.sinks.HbaseOut.serializer.columns actually defines the column qualifiers, in this case mapping incoming serialised log file rows into a set of HBase column families and qualifiers
collector.sinks.HbaseOut.serializer is the most important bit – and tells HBase how to turn the incoming Flume data into HBase loads, through a Java program called the “serializer”.
And its this serializer, the Java program that does the actual loading of the HBase table, that’s the final piece of the jigsaw. There are standard templates to use when writing this piece of code, and in our case the serializer looked like this:
HBase, rather than supporting the regular SELECT and INSERTS we’re used to with Oracle, instead uses “get” and “put” methods to retrieve, and store, data – along with “delete” and “scan”. The regular synchronous HBase sync uses these methods directly, taking data off the Flume channel and inserting it into the HBase table (or indeed, updating existing rows based on the row key), whilst the asychnronous method uses a layer in-between the incoming data and the write, allowing data (or “events”) to continue streaming in even if all the downstream data hasn’t get been committed. It’s this code though that maps each incoming bit of data – in this case, a parsed log file – to column families and qualifiers in the HBase table, and you’d need to write new code like this, or amend the exiting one, if you wanted to load other HBase tables in your Hadoop cluster – a long way from the point-and-click ETL approach we get with ODI, but a lot more flexible too (if that’s what you want).
Then it’s a case of compiling the Java code, like this:
This is all great, and a good starting point if you plan to process your data with other Java programs as the next step. But what if you want to view the data in a more convenient way, perhaps as a regular table? To do that you can use Hive again, this time using Hive’s HBase integration features to tell it the data is stored in HBase format, and to let it know how to display the various HBase column families and qualifiers. In our case, the DDL to create the corresponding Hive table looks like this:
giving us the ability, either from the Hive shell like this, or from tools like OBIEE and ODI, to query the NoSQL database and brings its data into more regular, relational data stores.
hive>select *from hive_apache_access_log;
140062856033120/May/2014:15:28:06+000089.154.89.101--GET/HTTP/1.12009054Mozilla/5.0(Macintosh;Intel Mac OSX10_9_2)AppleWebKit/537.75.14(KHTML,like Gecko)Version/7.0.3Safari/537.75.14-
140062856034420/May/2014:15:28:06+000089.154.89.101--GET/wp-content/plugins/crayon-syntax-highlighter/css/min/crayon.min.css?ver=2.5.0HTTP/1.1304-Mozilla/5.0(Macintosh;Intel Mac OSX10_9_2)AppleWebKit/537.75.14(KHTML,like Gecko)Version/7.0.3Safari/537.75.14http://www.rittmanmead.com/
140062856034520/May/2014:15:28:06+000089.154.89.101--GET/wp-content/plugins/jetpack/modules/widgets/widgets.css?ver=20121003HTTP/1.304-Mozilla/5.0(Macintosh;Intel Mac OSX10_9_2)AppleWebKit/537.75.14(KHTML,like Gecko)Version/7.0.3Safari/537.75.14http://www.rittmanmead.com/
We’ll be covering more on HBase, and Oracle’s NoSQL Database, in future articles on the blog.
Well we’re back in Brighton, UK now after the second successful week of the Rittman Mead BI Forum 2014. In Week 1, we went to the Seattle Hotel in Brighton (read the recap here), and then straight-after we flew over to Atlanta, GA, to run the second week – and it was possibly even better than Brighton ;-)
Congratulations to Omri Traub, winner of the Atlanta Best Speaker award – the first person from Oracle to win, in fact – and to all of the other presenters who helped put together an excellent event. If you’re interested, I’ve uploaded a bunch of photos from both Brighton and Atlanta to Flickr, and you can view the BI Forum 2014 photo set here.
As usual, where we’ve got permission (or the PDF) from the presenter, we’re making all of the presentations available for download to both attendees and non-attendees – not everyone can make it to the event, but we don’t want you to miss-out. We’re also very grateful to Lars George and Cloudera for making their Hadoop Masterclass slides available too – thanks everyone.
We’ve published a number of “getting started with Hadoop” articles over the past few months, but these aren’t always easy to find on the blog. I’ve therefore compiled a list of the more recent ones, which you’ll find below: