Day One of An OBIEE Project : Data Modeling

Yesterday I mentioned that I was working this week on an OBIEE proof of concept, I said at the time that I'd make some notes on how the project went and the methodology I used to pull the system together. As I obviously can't write about the actual client or system I built, I'll use the example of a university BI system in the examples and you can apply this back to your own particular situation.

The work I'm doing was originally planned as two phases;

  1. An initial phase to put the OBIEE data model together, and
  2. A phase where the reports and dashboards were generated.
As there was an expectation that the data might not be quite in the correct format on day one to report on, doing the work in two phases gives the client a week or so between my visits to create additional database tables or anything else that'll make the reporting easier. The plan is for OBIEE to run directly against the source application, which in this case handled applications for courses that the university offers.

I turned up at 9am and had the initial meeting with the client, after that I was given the following:

  1. A set of database diagrams showing the main areas of the application schema.
  2. A login to the database, and SQL Developer to query the tables, view the schema and so on
  3. Some example reports that they currently produce using Excel
  4. Access to the business analyst and application designers to answer my questions and add a bit of context.
  5. OBIEE installed and configured and ready to work with.
So, as projects go, this was not a bad start. Just to double check that everything was set up correctly, I logged in to OBIEE BI Administrator with the default Administrator/Administrator login, checked that it all looked OK, created an ODBC connection from the server through to the source database and then used BI Administrator to read in the table metadata, ran some row counts and generally made sure everything connected up OK.

Now that I was confident the software was set up correctly, I took some time out to discuss with the client the area we'd be reporting on (for the purposes of the example, the student registry at a university), and tried to identify the main business processes that made up their part of the organization. OBIEE needs the logical (or business) data model it works on to be a dimensional star schema (fact tables surrounded by dimension tables), and so through doing this I was trying to establish the facts (which generally are based on business processes, like students applying, offers being made, students being admitted, grades being published, courses being attended and so on) and the dimensions (the "actors" and "products" in the business process, in this case students, courses, faculties, sites, periods and so on), with the dimensions then being used to analyze the fact tables. All of this is actually all basic dimensional modeling, it's not specific to OBIEE, and the deliverable out of this part of the work is a conceptual dimensional model that shows the flow of data from one end of the business process to the other.

Once I carried out the first interviews, I logged on to SQL Developer and exported a list of the tables in the source schema into a text file, which I then cut and pasted in to Excel. I then went through the table list and categorized each of them into candidate fact tables, and candidate dimension tables. As a general guide, candidate fact tables generally contained transactional information (student admissions, the qualifications that students stated on their applications, the offers that were made, the various documents that were sent out during the application process) whilst dimensions contained the lookup or reference information that generally joined to the transaction tables. In most cases, information on students, say, was split over several tables (student_details, student_type, student_address and so on), when we come to use these tables in OBIEE we'll need to denormalize them in to a single logical dimension table.

At the end of this exercise, I had a list of candidate fact and dimension tables, which I then matched back to the notes I made when I spoke to the business analyst. This then gave me my initial fact and dimension tables that I would build in OBIEE BI Administrator, some of which would map one-to-one with the source tables, some of which would be a combination (denormalization) of several source tables. This, in a way, is the conceptual, theoretical dimensional model that represents the business and it's business processes, and could be used along with OBIEE to provide general ad-hoc reporting for the client's application. To take one example in particular, the candidate "applications" fact table contained a count of applications and the grades score of the applicant, and links through to a number of dimension tables including application status, which could be set to received, in processing, rejected, offer sent, in clearing and so on. A diagram of this fact and it's dimensions would look like this:

As I said, so far, so good. The next stage was to take a look at the reports that the customer needed. And this was the interesting part - the first report, the most important report that they needed, wasn't a listing of student types by faculty with application status, say, as a page item, it was a very long tabular report with a number of measures across the top that reported on applications at various stages and over varying time periods, something like this:

Now if you've tried to produce this report with the sort of dimensional model I've got above, you'll have probably spotted the problem. The columns across the report represent applications at different stages, and any one measure you can obtain by selecting the correct value from the application status dimension and totalling up the application count. In this report though, you've got the same measure displayed over and over again, but not for all application statuses, just an arbitrary selection of them. Moreover, some of the measures are for one week, some are for a quarter and some are year-on-year comparisons. Now with an OLAP server and OLAP query tool such as Essbase and Web Analyzer, making these selections of measures by arbitrary application statuses and time periods is not a problem, but for relational query tools such as OBIEE it's a bit more of a problem, especially if you've got more than one measure and each measure has an arbitrary selection of statuses and time periods.

The way I've got around this on other OBIEE projects is firstly, to break out each measure by application status and have it listed in the logical fact table, and create additional measures that take your base measures and calculate it over specific time periods, such as month-to-date, year-to-date, year-on-year comparison, and so on. This has the drawback of "exploding" your fact table model from one or two measures to one per status and time period, but it gives you a particular column you can select for each combination of measure, period and status, with the complexity of populating them handled by SQL DECODES and the Time Series (AGO and TODATE) calculations you now get with OBIEE 10.1.3.2 onwards. What this all leads to is a logical model that looks something like this:

So breaking the measures out like this gives me the ability to list applications by status across the report page, once I've got these base measures I can then use the time series calculation abilities of OBIEE to calculate totals over different time periods. And that's where I stopped for the day, tomorrow I'll finish off the model and produce the first iteration of the reports, which is interesting as the original plan was to create the data model in the first week and the reports in the second, but in reality you can't really produce the data model to any sort of final degree until you've seen how data is used in reports. It's because of issues like this that I try and encourage clients to build their initial OBIEE data model against source data in place, to drive out all these reporting "oddities", before committing the data model to a data warehouse and doing all the work around ETL - it's very easy to change the model around at this stage as it's only "virtual", if you've just spend six months populating a data warehouse of this design though, and then you find you need to break out the measures in this way, it can be a bit annoying.

Anyway, as I said that's it for today. Tomorrow I'll be putting in place all the DECODES, time-series queries and so on and producing the first reports. After that I'll be building the logical model out, adding some additional fact tables and rounding off this stage of the logical model build.