Creating an Oracle Endeca Information Discovery 2.3 Application Part 1 : Scoping and Design

A couple of months ago I put together a five-part series on our blog around an introduction to Oracle Endeca, a recent BI and e-commerce acquisition by Oracle aimed primarily at strengthening their unstructured data and search capabilities. Endeca is particularly interesting to us Oracle BI developers as it's a whole new BI platform, with some areas of overlap with Oracle BI (dashboards, graphs, charts and an ETL tool) and some which provide new capabilities (faceted search, text enrichment, the hybrid search/analytic database engine called Endeca Server, previous MDEX). Towards the end of that series I put together a brief article on how Endeca Information Discovery applications were put together, and just afterwards Venkat wrote a more detailed article on using Endeca to analyse blog and twitter feeds, but having worked with the tool now for a while I wanted to look into Endeca application development in a bit more detail, to find out just how you went about creating a typical Endeca Information Discovery system.

So over the next three days, I'm going to go into a bit more detail on the various stages in building a simple Oracle Endeca Information Discovery application, using the recent Endeca Information Discovery 2.3 release that's currently available for download on http://edelivery.oracle.com, with links to the series postings below, updated as each article is posted.

OEID isn't available for download on the OTN website yet as parts of the product (mainly the Liferay Portal that's used as the dashboard framework) can't be distributed using the regular OTN route, so pop over to Edelivery and download either the Windows 64-bit version, or the Linux 64-bit version; both should offer the same functionality, and I went for the Windows version, installing it on the 180-day trial version of Windows Server 2008 64-bit that you can download from Microsoft's website here.

The quickest way to get up and running with OEID 2.3 is to download and install the "Oracle Endeca Information Discovery Quick Start (2.3) for Microsoft Windows x64 (64-bit)" (part no. V32109-01), a 382MB download containing a simple setup guide and an all-in-one installer that gives you the various server elements, and development tools, to put an OEID application together. It also comes with a demo system, also called "quick start", that gives you a sample dashboard, dataset and ETL project that we'll use in these articles to talk about the development process. Once you've installed all of the software, run the data loading routines and loaded up the dashboards, you'll have a demo system that shows off some of the capabilities of the Endeca Information Discovery 2.3 platform.

Studio Integrator 23

So how would you go about building an Endeca application like this, and more importantly, why would you choose Endeca Information Discovery to do this, rather than Oracle Business Intelligence Enterprise Edition (OBIEE?) What does Endeca provide that you can't do with OBIEE, and what sort of boundaries are there on an Endeca BI system that you should look for when scoping out an Endeca project? And, most importantly as a developer, where do you get the information you need to get started with the product?

Probably the most important thing to understand before you do anything with Endeca, is what is it good at and what sort of analysis does it best support. Oracle Endeca Information Discovery is positioned by Oracle as a "data discovery" platform, to complement the existing query, reporting, analysis and production reporting capabilities provided by BI Foundation Suite.

Sshot 3

What this means in simple terms is that applications built on the Endeca platform are primarily search-led; using the search and guided navigation features on the dashboard, you narrow down the total set of data to the subset you're particularly interested in, then you analyse it using familiar BI objects such as graphs, tables and pivot tables, as well as text-centric visualisations such as word clouds. With query and reporting tools such as OBIEE and Essbase, the assumption is that you already know what data you're interested in, and that data has been heavily structured into a dimensional model with hierarchies to guide your navigation. With Endeca Information Discovery, the assumption is that you're more interested in searching through a wide, loosely-related data set, with no clear drill paths or routes through the data, and where the information you're interested in is just as likely to come from a set of Word documents or call-centre transcripts as a set of structured database tables. As a way of analysing data it's got more in common with tools like Qliktech's Qlikview, which also stores data in-memory, links together disparate sets of data and then provides a user interface that's primarily driven by filtering on attribute values.

Sshot 5

One way of thinking of the difference between Endeca Information Discovery is that it provides "fast answers to new questions", compared to OBIEE and Essbase's "proven answers to known questions". Endeca Information Discovery applications make it easy to bring in new sources of information with minimal upfront time for data modelling, specialising in multiple data sources that may be difficult to relate and may change over time. Most famously, Endeca is able to handle unstructured and semi-structured data sources as well as more traditional structured ones, and stores all of its data in the Endeca Server, a type of database engine somewhere in-between a search engine and an OLAP engine, that holds data for analysis in-memory with a disk-based column-store database for data persistence.

Typically, Oracle Endeca Information Discovery is a good solution when the two following indicators are true:

  • The business can't be sure which questions will matter, because there are two many variables to work out all of the combinations, and/or the business environment is prone to changes that cannot be anticipated fully
  • IT can't be sure which data model will work because diversity of the source schemas makes conforming the data to a single model difficult and time consuming, and/or the sources include unstructured data, and/or the schemas change frequently, requiring costly and time-consuming re-work

So given this, how do we go about putting together an Oracle Endeca Information Discovery application? At a high level, the tasks and steps go something like this:

  1. Source data from one or more data sources, and load them using Oracle Endeca Information Discovery Integrator into an Endeca Server datastore (equivalent to a database)
  2. Refine and fine-tune the datastore by applying business-term labels to attributes, grouping attributes into sets, defining search interfaces and creating views (abstraction layers) to aid analysis
  3. Optionally, use additional Endeca Information Discovery tools to deduce meaning and sentiment from your unstructured/semi-structured data, and bring in additional data from web feeds, Twitter etc
  4. Create your user interface, adding search and guided navigation, and visualisations such as charts, tables, pivot tables and word clouds

The diagram below shows the key components and flows of data in a typical Oracle Endeca Information Discovery application, simplified somewhat in that there are actually server and client parts to the Integrator and Studio elements:

OEID dataflow

Although the headline feature with Endeca is "unstructured data", in fact most of the data that goes into the Endeca Server is structured in some way or another, even if it's just a case of free-form text being tagged with a customer or product ID. In fact, most of the use-cases around OEID have a structured data set (for example, sales figures) enhanced by some semi-structured data, for example comments from Twitter or from a Facebook page. If you're wondering whether a particular set of data is suitable for an Endeca application, this is the advised "sweet spot" in terms of size, diversity and numbers of data sources:

  • Between 1 and 10 data sources, as the complexity of the ETL gets unwieldy beyond a certain point
  • At least one structured source and one un/semi-structured source, such as documents, web feeds, comments and so on
  • 25 or more attributes to use for navigation and search - less is fine but the search/analysis paradigm of the tool works best when there are lots of interesting dimensions and attributes
  • Between 1 and 50 metrics, with up to another 25 derived (calculated) metrics, with the upper limit more to keep the system from becoming unwieldy
  • Between 5m and 50m Endeca Server records - roughly equivalent to source system transactions
  • Data is updated and incremented typically once a day, or even three or four times a day - as Endeca Server data isn't pre-aggregated, there's not the overhead of maintaining aggregates during the data load

The other thing to bear in mind with Oracle Endeca Information Discovery applications is that the reports tend to be pre-defined by the system builder, and then shared and used by the user community. Each report is highly flexible and allows searching and analysis across any attribute that's in scope, but it's not solution for customers looking to create lots of individual, ad-hoc reports. As we'll see later in the week, developing Endeca Information Discovery Studio applications is quite an interactive, prototyping-type experience, and there's more of a blurred-line between report developers and end-users, but if you're looking to create catalogs of reports, dashboards and other BI objects as opposed to free-form, data exploration-type dashboards, then you're probably best off using OBIEE.

So now we've gone through what Endeca Information Discovery applications are best at - data discovery across disparate sources that can be anything from a set of sales data through to customer comments about products you're offering - how do we go about building a OEID application? Well, the Quickstart application that comes with the Quickstart Installer for OEID 2.3 is actually a good place to start, and there's actually a set of Getting Started with Oracle Endeca Information Discovery Youtube videos that have just been posted online that take you through the development process for the Quickstart application, taking you through parts of the build process to get a feel for how the development tools work.

So what we're going to do for the rest of this week is go through some of the stages, referring to the Quickstart application and the Youtube videos for more information, with the aim being to go through in a bit more detail what's involved in creating an Oracle Endeca Information Discovery 2.3 application. We'll start tomorrow with probably the most involved step - creating a new Endeca Server datastore and loading data into it using Endeca's ETL tool, Oracle Endeca Information Discovery Integrator.