Australian March Training Offer

February 28th, 2015 by

Autumn is almost upon us here in Australia so why not hold off  going into hibernation and head into the classroom instead.

For March and April only, Rittmanmead courses in Australia* are being offered at significantly discounted prices.

Heading up this promotion is the popular TRN202 OBIEE 11g Bootcamp course which will be held in Melbourne, Australia* on March 16th-20th 2015.

This is not a cut down version of the regular course but the entire 5 day content. Details

To enrol for this specially priced course, visit the Rittmanmead website training page. Registration is only open between March 1st – March 9th 2015 so register quickly to secure a spot.

Further specially priced courses will be advertised in the coming weeks.

*This offer is only available for courses run in Australia.
Registration Period: 01/03/2015 12:00am – 09/03/2015 11:59:59pm
Further Terms and Conditions can be found during registration

Introducing Oracle Big Data Discovery Part 3: Data Exploration and Visualization

February 27th, 2015 by

In the first two posts in this series, we looked at what Oracle Big Data Discovery is and how you can use it to sample, cleanse and then catalog data in your Hadoop-based data reservoir. At the end of that second post we’d loaded some webserver log data into BDD, and then uploaded some additional reference data that we then joined to the log file dataset to provide descriptive attributes to add to the base log activity. Once you’ve loaded the datasets into BDD you can do some basic searching and graphing of your data directly from the “Explore” part o the interface, selecting and locating attribute values from the search bar and displaying individual attributes in the “Scratchpad” area.

NewImage

With Big Data Discovery though you can go one step further and build complete applications to search and analyse your data, using the “Discover” part of the application. Using this feature you can add one or more charts to a dashboard page that go much further than the simple data visualisations you get on the Explore part of the application, based on the chart types and UI interactions that you first saw in Oracle Endeca Information Discovery Studio.

NewImage

Components you can add include thematic maps, summary bars (like OBIEE’s performance tiles, but for multiple measures), various bar, line and bubble charts, all of which can then be faceted-searched using an OEID-like search component.

NewImage

Each visualisation component is tied to a particular “view” that points to one or more underlying BDD datasets – samples of the full dataset held in the Hadoop cluster stored in the Endeca Server-based DGraph engine. For example, the thematic map above was created against the post comments dataset, with the theme colours defined using the number of comments metric and each country defined by a country name attribute derived from the calling host IP address.

NewImage

Views are auto-generated by BDD when you import a dataset, or when you join two or more datasets together. You can also use the Endeca EQL language to define your own views using a SQL-type language, and then define which columns represent attributes, which ones are metrics (measures) and how those metrics are aggregated.

NewImage

Like OEID before it, Big Data Discovery isn’t a substitute for a regular BI tool like OBIEE – beyond simple charts and visualizations its tricky to create more complex data selections, drill-paths in hierarchies, subtotals and so forth, and users will need to understand the concept of multiple views and datatypes, when to drop into EQL and so on – but for non-technical users working in an organization’s big data team it’s a great way to put a visual front-end onto the data in the data reservoir without having to understand tools like R Studio.

So that’s it for this three-part overview of Oracle Big Data Discovery and how it works with the Hadoop-based data reservoir. Keep an eye on the blog over the next few weeks as we get to grips with this new tool, and we’ll be covering it as part of the optional masterclass at the Brighton and Atlanta Rittman Mead BI Forum 2015 events this May.

Why and How to use Oracle Metadata Management 12c. Part 2: Importing and Relating Metadata

February 26th, 2015 by

In the first post of this series we have seen how to install and configure OEEM to start working with it. In this new post we are going to see how we import the metadata into OEMM from different sources like Oracle Database, OBIEE, ODI and OWB and then relate all of them inside OEMM.

oemm_main_page

After we have installed and configured OEMM, we need to start adding all the metadata from the different sources and applications that we use. In this example the sources will be some Oracle schemas and our applications will be ODI, OWB and OBIEE. To import the metadata for all of them we need to create one model in OEMM for each. A model in OEMM has all the connection details for a specific source or metadata provider (i.e: database schema, ODI repository, etc), and is also the container for the metadata of that specific source after the import process. So one model can connect to one specific source or application.

First and for organisational purposes we will create a Folder to contain the future models.  You can also create your models first, and then create the folder/s that you want and then just move the models under the correspondent folders. In addition, you can create folders within another folder.

To create a folder, right-click on the Repository entry under the Repository panel in the OEMM main page. Select New > Folder in the pop-up menu, enter a name and press the Create button.

 

oemm_folder_comb

The next step is creating the models and import the metadata of the different sources. The import or reverse engineering process is named harvesting in OEMM. We will start with the model for the Oracle Database. In this particular example I used Oracle 12c.

To create a model right click on the folder or the repository entry, and select New > Model. In the Create Model window that appears, enter a name for the new model and select the type of source that you want to import or to be more precise which will be the Import bridge that this model will use.

The Import Bridge is part of the Meta Integration® Model Bridge (MIMB) software and is the way that OEMM connect to the sources and applications to reverse engineering the metadata. You will find import bridges for a wide range of technologies like different databases, Business Intelligence  and Data Integration products from different vendors, Big Data stores, etc.

oemm_model_comb

For this first example we will select the Oracle Database (via JDBC) import bridge and in the Import Setup tab we will add all the usual connection details: host, port, service and user and password to connect to the Database. This user should have at least the CONNECT privilege and the SELEC_CATALOG_ROLE role. We can also define this model for specific schemas using the magnifying glass to choose the shown schemas or just write the schemas (in uppercase) separated by “;”. Also we can decide if we want that the stored procedures are going to be included in this imported metadata or not.

oemm_model_conn_db

After all the connection details have set, we test the connection and wait until we receive the Connection Successful message, and finally press the Create button. A message windows will appear asking if we want to “Import a new version now?” Press yes to start the harvesting process. A log window will show you the progress in the import process that can take several minutes. After the process is finished a new windows message ask if we want to open the model.

oemm_imp_succ_open_model

Choose yes to see all the objects that are imported for this model as it is shown in the figure below.

ORCL_training_model

We need to repeat the process explained above to create the models for the rest of sources and applications that we are going to use in this example. The process is the same for all of them but of course there are some differences in the connection details required after we chose the specific Import Bridge for each one.

In the next screenshot you will find the connection details for the ODI model after you choose the Oracle Data Integrator (ODI) Import Bridge. In the Import Setup tab, you need to select the appropriate driver to connect to the database where is the ODI Repository (that could be Oracle, SQLServer, DB2, etc), the ODI Home folder, the URL to connect to database, the schema and the password for the Master Repository, user and password for the ODI User (SUPERVISOR for example), the name of the Work Repository from that we want to select the ODI Objects and the Context.

oemm_odi_import_setup

We need to select the Scope for this model between two options: Projects, that will include packages and mappings (or interfaces for versions before 12c) or Load Plans and Scenarios, that includes the Load Plans and Scenarios.

After we chose the Scope we can also filter the Content of the scope pressing the magnifying glass icon and select the specific objects that we want for this model.

After you press the create button to start the harvesting process, open the model created and it will look similar to this if you choosing Projects as Scope.

oemm_odi_projects

For the connection details to create the OWB model , you need to take a couple of things into account. First, the version of the OWB from which you want to import the metadata. If it is 11.2.0.3 or later you will need to do the these two steps before:

  1. Copy the following .JAR files from: MetaIntegrationInstallationDir\java\ to %OWB_HOME%\owb\lib\ext\
  • jsr173_1.0_api.jar
  • MIR.jar
  • MIRModelBridge.jar
  • MIROracleWarehouseBuilderOmb.jar
  • MIRUtil.jar
  • stax-1.1.1-dev.jar
  1. Copy the MetaIntegrationInstallationDir\bin\mimbexec.bat file into the same OWB directory.

As the version that I have is 11.2.0.4, I copy the files detailed above, set the connection parameters like is shown in the following image and test the connection.

owb model config

When I started the import process the following error message appears in the log windows:

error_owb2

 

After trying many things unsuccessfully, I asked David Allan for help and he sent me another mimbexec.bat because apparently between 11.2.0.3 and 11.2.0.4 there were directory name changes.  This a temporary fix and a proper one is being worked on.

I substituted the bat file and I received another error message as it is shown in the next screenshot.

error_importing_owb_externaltable

 

After a while, I realised that the issue that OEMM reported was because I was using an external table in one of the mappings. I changed it for a common table and the import process worked well. This has reported as a bug and a solution is being worked on. I really want to thank David for all his invaluable help on that.

This is how it looks the OWB model after the import of the metadata.

owb_model

The last model that we need to create is the one based on OBIEE. There are different import bridges depending on the metadata that we need to import from OBIEE. Could be Oracle Business Intelligence (OBI) Server, Oracle Business Intelligence (OBI) Enterprise Edition and Oracle Business Intelligence (OBI) Answers.

The OBI Server import bridge needs the OBI repository in xml format as a parameter to import it, and the result model will contain all the objects defined in the three layers of the repository (Presentation, Business Model, Physical) as well as the repository connections, the variables and the initialisation blocks defined in the OBI repository.

oemm_biserver_comb

To use the OBI Enterprise Edition import bridge we need to set the login user and password to connect to OBIEE (usually weblogic or a user with admin privileges), the repository file in xml format, and we can also filter the amount of reports retrieved from the OBI Presentation Server.

There are a couple of interesting not mandatory options, one is for optimise the import of large models which if it sets to true doesn’t return some objects like joins, relationships, logical fk, etc., to consume less memory at run time. And another option is to set if we want to do an incremental import to import only the changes of the source or each time we want to import everything.

oemm_obiee_comb

The last import bridge to use with OBI is the OBI Answers, which will be import the content for a particular analysis or KPI report. This bridge needs to have the specific analysis in XML format.

oemm_bi_analysis_comb

 

About models, there are a couple of additional things that you need to take note. First if you want to see the configuration details you need to right-click the model and choose the settings option from the pop-up menu. In case that you want to open the model to see the objects that contains, double-click on it.

Another thing is for every parameter that you have in a model, you will find a very detailed help at the right in the import setup tab; and if you click on the name of the Import Bridge in the same tab, you have the documentation of this particular bridge which I find it very useful.

There are two tabs more in the folder and model definition that we won’t use in this example but that we talk in future posts: one for security and another to executing scripts when an event happens to this object. Models also have an additional tab Import Schedule, to create a plan to do the harvest process.

Relate the models

Once we have defined our models we need to relate them and to validate their relationship. The automated process of relate these models through the validation is named stitching. In order to do that we must create a Configuration first. A configuration in OEMM is a collection of models and another objects like mappings, glossaries, etc, that are related in someway.

According to the online documentation we need to consider a configuration as any of these options:

  • Repository workspace: a collection of Repository Objects to be analyzed together (search, browse, reports, etc.) as a technical scope, or business area under the same access permission scope.
  • Enterprise architecture – a collection of data store Models (ODS, data staging areas, data warehouses, data marts, etc.) and data process Models (ETL/DI, and BI) connected together through data flow stitching.
  • Design workflow – a collection of conceptual, logical and physical Models connected (semantically stitched) together through semantic mappings modeling the design process.

To create a Configuration, just right-click on a selected folder or the repository entry and choose New> Configuration. Enter a name for the configuration and press the Create button.

oemm_new_config

The configuration is opened and you need to drag the models that you want to be stitched inside this configuration as it is shown in the following screenshot

oemm_models_config

As you drag and drop your models, you can see that some of them have a warning icon after you include them in the configuration, and that is because we need to connect that model with the appropriate source of data.

To do that, select the model in the configuration and press Edit Connection. Choose the correspondent store for each connection and press OK.

oemm_edit_conn_config

oemm_conn_edit1

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

After you finish with all the models, press the Validate button, to start stitching or relate them.

oemm_validate_config

In most of the cases, OEMM can assign the correspondent default schema for each of the connections in the model. If in some cases cannot do it , like in OWB, you need to do it manually.

oemm_conn_config_owb

In the following image you will see all the models validated. For this example, I’ve created four databases models, one that contains the source (transactional system), one for the staging schema, and another two that contains different data warehouses. Also an ODI model, an OWB model and an OBIEE model.

oemm_models_validated

You can see also the relationship between the models that belong to a configuration in a graphical view in the Architecture Diagram tab of the Configuration. If the diagram looks like a little messy, you  can press the Edit button and then Layout to order the way the components are shown.

oemm_conf_arch_diag

In summary, we create and harvesting (reverse-engineer) the models and then relate or stitching them to can analyse them together. In the next post, we will see some interesting stuff that we can do with configurations and models like trace data lineage and trace data impact.

Introducing Oracle Big Data Discovery Part 2: Data Transformation, Wrangling and Exploration

February 25th, 2015 by

In yesterday’s post I looked at Oracle Big Data Discovery and how it brought the search and analytic capabilities of Endeca to Hadoop. We looked at how the Oracle Endeca Information Discovery Studio application works with a version of the Endeca Server engine to analyse and visualise sample sets of data from the Hadoop cluster, and how it uses Apache Spark to retrieve data from Hadoop and then transform that data to make it more suitable for data discovery and data analysis applications. Oracle Big Data Discovery is designed to work alongside ODI and GoldenGate for Big Data once you’ve decided on your main data flows, and Oracle Big Data SQL for BI tool and application access to the entire “data reservoir”. So how does Big Data Discovery work, and what role does it play in the overall big data project workflow?

The best way to think of Big Data Discovery, to my mind, is “Endeca on Hadoop”. Endeca Information Discovery had three main parts to it; the data loading part performed using Endeca Information Discovery Integrator and more recently, the personal data upload feature in Endeca Information Discovery Studio. Data was then ingested into the Endeca Server engine and stored in a key/value-store NoSQL database, indexed, parsed and enriched, and then analyzed using the graphical user interface provided by Studio. As I explained in more detail in my first post in the series yesterday, Big Data Discovery runs the Studio and DGraph (Endeca Server) elements on one or more dedicated nodes, and then reads data in from Hadoop and then writes it back in transformed states using Apache Spark, as shown in the diagram below:

NewImage

As the data discovery and analysis features in Big Data Discovery rely on getting data into the DGraph (Endeca Server) engine first of all, this implies two things; first, we’ll need to take a subset or sample of the entire Hadoop dataset and load just that into the DGraph engine, and second we’ll need some means of transforming and “massaging” that data so it works well as a data discovery set, and then writing those changes back to the full Hadoop dataset if we want to use it with some other tool – OBIEE or Big Data SQL, for example. To see how this process works, let’s use the same Rittman Mead Apache webserver logs that I’ve used in my previous examples, and bring that data and some additional reference data into Big Data Discovery.

The log data from the RM webserver is in Apache Combined Log Format and a sample of the rows looks like this:

NewImage

For data to be eligible to be ingested into Big Data Discovery, it has to be registered in the Hive Metastore and with the metadata available to use by external tools using the HCatalog service. This means that you already need to have created a Hive table over each datasource, either pointing this table to regular fixed-width or delimited files, or using a SerDe to translate another file format – say a compressed/column-store format like Parquet – into a format that Hive can understand. In our case I can use the RegEx SerDe that I first used in this blog post a while ago to create a Hive table over the log file and split out the various log file elements, with the resulting DDL looking like this:

If I then register the SerDe with Big Data Discovery I could ingest the table and file at this point, or I can use a Hive CTAS statement to remove the dependency on the SerDe and ingest into BDD without any further configuration.

At this point, if you’ve got the BDD Hive Table Detector running, it should pick up the presence of the new hive table and ingest it into BDD (you can whitelist table names, and restrict it to certain Hive databases if needed). Or, you can manually trigger the ingestion from the Data Processing CLI on the BDD node, like this:

The data processing process then creates an Apache Oozie job to sample a statistically relevant sample set of data into Apache Spark – with a 1% sample providing 95% sample accuracy – that is the profiled, enriched and then loaded into the Big Data Discovery DGraph engine for further transformation, then exploration and analysis within Big Data Discovery Studio.

NewImage

The profiling step in this process scans the incoming data and helps BDD determine the datatype of each Hive table column, the distribution of values within the column and so on, whilst the enrichment part identifies key words and phrases and other key lexical facts about the dataset. A key concept here also is that BDD typically works with a representative sample of your Hive table contents, not the whole contents, as all the data you analyse has to fit within the memory space with the DGraph engine, just like it used to with Endeca Server. At some point its likely that the functionality of the DGraph engine will be unbundled from the Endeca Server and run natively across the actual Hadoop cluster, but for now you have to separately ingest data into the DGraph engine (which can run clustered on BDD nodes) and analyse it there – however the rules of sampling are that if you’ve got a sufficiently big sample – say, 1m rows – regardless of the actual main dataset size this sample set is considered sufficiently representative – 95% in this case – as to make loading a bigger sample set not really worth the effort. But bear in mind when working with a BDD dataset that you’re working a sample, not the full set, so if a value you’re looking for is missing it might be because it’s not in this particular sample.

Once you’ve ingested the new dataset into BDD, you see it listed amongst the others that have previously been ingested, like this:

NewImage

At this point you can explore the dataset, to take an initial look at the patterns and values in the dataset in its raw form.

NewImage

Unfortunately, in this raw form the data in the access_logs table isn’t all that useful – details of the page request URL are mixed in with the HTTP protocol and method, for example; dates are in strings; details of the person accession the site are in IP address format rather than a geographical location, and so on. In previous examples on this blog I’ve looked at various methods to cleanse, transform and enhance the data in log file tables like this, using tools and techniques such as Hive table transformations, Pig and Apache Spark scripts, and ODI mappings but all of these typically require some IT invovement whereas one of the hallmarks of recent versions of Endeca Information Discovery Studio was giving power-users the ability to transform and enrich data themselves. Big Data Discovery provides tools to cleanse, transform and enrich data, with menu items for common transformations and a Groovy script editor for more complex ones, including deriving sentiment values from textual data and stripping out HTML and formatting characters from text.

NewImage

Once you’ve finished transforming and enriching the dataset, you can either save (commit) the changes back to the sample dataset in the BDD DGraph engine, or you can use the transformation rules you’ve defined to apply those transformations to the entire Hive table contents back on Hadoop, with the transformation work being done using Apache Spark. Datasets are loaded into “projects” and each project can have its own transformed view of the raw data, with copies of the dataset being kept in the BDD DGraph engine to represent each team’s specific view onto the raw datasets.

NewImage

In practice I found this didn’t, at the current product state, completely replace the need for a Hadoop developer or R data analyst – you need to get your data files into Hive and HCatalog at the start which involves parsing and interpreting semi-structured data files, and I often did some transformations in BDD, then applied the transformations to the whole Hive dataset and then re-imported the results back into BDD to start from a simple known state. But it certainly made tasks such as turning IP addresses into countries and cities, splitting our URLs and removing HTML tags much easier and I got the data cleansing process done in a matter of hours compared to the days with manual Hive, Pig and Spark scripting.

Now the data in my log file dataset is much more usable and easy to understand, with URLs split out, status codes grouped into high-level descriptors, and other descriptive and formatting changes made.

NewImage

I can also at this point bring in additional datasets, either created manually outside of BDD and ingested into the DGraph from Hive, or manually uploaded using the Studio interface. These dataset uploads then live in the BDD DGraph engine, and are then written back to Hive for long-term persistence or for sharing with other tools and processes.

NewImage

These datasets can then be joined to the main dataset on matching dataset columns, giving you a table-join interface not unlike OBIEE’s physical model editor.

NewImage

So now we’re in a position where our datasets have been ingested into BDD, and we’ve cleansed, transformed and joined them into a combined web activity dataset. In tomorrow’s final post I’ll look at the data visualisation part of Big Data Discovery and see how it brings the capabilities of Endeca Information Discovery Studio to Hadoop.

Introducing Oracle Big Data Discovery Part 1: “The Visual Face of Hadoop”

February 24th, 2015 by

Oracle Big Data Discovery was released last week, the latest addition to Oracle’s big data tools suite that includes Oracle Big Data SQL, ODI and it’s Hadoop capabilities and Oracle GoldenGate for Big Data 12c. Introduced by Oracle as “the visual face of Hadoop”, Big Data Discovery combines the data discovery and visualisation elements of Oracle Endeca Information Discovery with data loading and transformation features built on Apache Spark to deliver a tool aimed at the “Discovery Lab” part of the Oracle Big Data and Information Management Reference Architecture.

NewImage

Most readers of this blog will probably be aware of Oracle Endeca Information Discovery, based on the Endeca Latitude product acquired as part of the Endeca aquisition. Oracle positioned Endeca Information Discovery (OEID) in two main ways; on the one hand as a data discovery tool for textual and unstructured data that complemented the more structured analysis capabilities of Oracle Business Intellligence, and on the other hand, as a fast click-and-refine data exploration tool similar to Qlikview and Tableau.

The problem for Oracle though was that data discovery against files and documents is a bit of a “solution looking for a problem” and doesn’t have a naturally huge market (especially considering the license cost of OEID Studio and the Endeca Server engine that stores and analyzes the data), whereas Qlikview and Tableau are significantly cheaper than OEID (at least at the start) and are more focused on BI-type tasks, making OEID a good too but not one with a mass market. To address this, whilst OEID will continue as a standalone tool the data discovery and unstructured data analysis parts of OEID are making their way into this new product called Oracle Big Data Discovery, whilst the fast click-and-refine features will surface as part of Visual Analyzer in OBIEE12c.

More importantly, Big Data Discovery will run on Hadoop making it a solution for a real problem – how to catalog, explore, refine and visualise the data in the data reservoir, where data has been landed that might be in schema-on-read databases, might need further analysis and understanding, and users need large-scale tooling to extract the nuggets of information that in time make their way into the “Execution” part of the Big Data and Information Management Reference Architecture. As some who’s admired the technology behind Endeca Information Discovery but sometimes struggled to find real-life use-cases or customers for it, I’m really pleased to see its core technology applied to a problem space that I’m encountering every day with Rittman Mead’s customers.

NewImage

In this first post, I’ll look at how Big Data Discovery is architected and how it works with Cloudera CDH5, the Hadoop distribution we use with our customers (Hortonworks HDP support is coming soon). In the next post I’ll look at how data is loaded into Big Data Discovery and then cataloged and transformed using the BDD front-end; then finally, we’ll take a look at exploring and analysing data using the visual capabilities of BDD evolved from the Studio tool within OEID. Oracle Big Data Discovery 1.0 is now GA (Generally Available) but as you’ll see in a moment you do need a fairly powerful setup to run it, at least until such time as Oracle release a compact install version running on VM.

To run Big Data Discovery you’ll need access to a Hadoop install, which in most cases will consist of 6 (minumum 3 or 4, but 6 is the minimum we use) to 18 or so Hadoop nodes running Cloudera CDH5.3. BDD generally runs on its own server nodes and itself can be clustered, but for our setup we ran 1 BDD node alongside 6 CDH5.3 Hadoop nodes looking like this:

NewImage

Oracle Big Data Discovery is made up of three component types highlighted in red in the above diagram, two of which typically run on their own dedicated BDD nodes and another which runs on each node in the Hadoop cluster (though there are various install types including all on one node, for demo purposes)

  • The Studio web user interface, which combines the faceted search and data discovery parts of Endeca Information Discovery Studio with a lightweight data transformation capability
  • The DGraph Gateway, which brings Endeca Server search/analytics capabilities to the world of Hadoop, and
  • The Data Processing component that runs on each of the Hadoop nodes, and uses Hive’s HCatalog feature to read Hive table metadata and Apache Spark to load and transform data in the cluster

The Studio component can run across several nodes for high-availability and load-balancing, which the DGraph element can run on a single node as I’ve set it up, or in a cluster with a single “leader” node and multiple “follower” nodes again for enhanced availability and throughput. The DGraph part them works alongside Apache Spark to run intensive search and analytics on subsets of the whole Hadoop dataset, with sample sets of data being moved into the DGraph engine and any resulting transformations then being applied to the whole Hadoop dataset using Apache Spark. All of this then runs as part of the wider Oracle Big Data product architecture, which uses Big Data Discovery and Oracle R for the discovery lab and Oracle Exadata, Oracle Big Data Appliance and Oracle Big Data SQL to take discovery lab innovations to the wider enterprise audience.

NewImage

So how does Oracle Big Data Discovery work in practice, and what’s a typical workflow? How does it give us the capability to make sense of structured, semi-structured and unstructured data in the Hadoop data reservoir, and how does it look from the perspective of an Oracle Endeca Information Discovery developer, or an OBIEE/ODI developer? Check back for the next parts in this three part series where I’ll first look at the data transformation and exploration capabilities of Big Data Discovery, and then look at how the Studio web interface brings data discovery and data visualisation to Hadoop.

Website Design & Build: tymedia.co.uk