Oracle 12.1.0.2 and Data Warehouses

September 1st, 2014 by

If you follow Blogs and Tweets from the Oracle community you won’t have missed hearing about the recent release of the first patch-set for Oracle 12c. With this release there are some significant pieces of new functionality that will be of interest to Data Warehouse DBAs and architects. The headline feature that most Oracle followers will know of is the new in-memory option. In my opinion this is a game-changer for how we design reporting architectures; it gives us an effective way to build operational reporting over the reference data architecture described by Mark Rittman a few weeks ago. Of course, the database team here at Rittman Mead have been rolling up our sleeves and getting into in-memory technology for quite a while now, Mark even featured in the official launch presentation by Larry Ellison with the now famous “so easy it’s boring” quote. Last week Mark published the first of our Rittman Mead in-memory articles, with the promise of more in-memory articles to come including my article for the next edition of UKOUG’s “Oracle Scene”.

However, the in-memory option is not the only new feature that is going to be a benefit to us in the BI/DW world. One of the new features I am going to describe is Exadata only, but the first one I am going to mention is generally available in the 12.1.0.2 database.

Typically, data warehouse queries are different from those seen in the OLTP world – in DW we tend to access a large number of rows and probably aggregate things up to answer some business question. Often we are not using indexes and instead scanning tables or table partitions is the norm. Usually, the data we need to aggregate is widely scattered across the table or partition. Data Warehouse queries often look at records that share a set of common attributes; we look at the sales for the ‘ACME’ widget or the value of items shipped to Arizona. For us there can be great advantage if data we use together is stored together, and this is where Attribute Clustering can pay a part.

Attribute Clustering is usually configured on the table at at DDL time and in-effect controls the ordering of data inserted by DIRECT PATH operations, Oracle does not enforce this ordering for conventional inserts, this may not be an issue in data warehouses as bulk-batch operations typically use APPEND inserts, which are direct path inserts, or partition operations, it may be more of an issue with some of the real-time conventional path loading paradigms. In addition to Direct Path load operations Attribute Clustering can also occur when you do Alter table MOVE type operations (this also includes operations such as PARTITION SPLIT). On the surface, Attribute Clustering sounds little different to using an ORDER by on an append insert and hoping that Oracle actually stores the data where you expect it to. However, Attribute Clustering gives us two other possibilities in how we can order the data in the cluster.

Firstly, we can cluster on columns from JOINED dimension tables, for example in a SALES DW we may have a sales fact with a product key at the SKU level, but we often join to the product dimension and report at the Product Category level. In this case we can cluster our sales fact table so that each product category appears in the same cluster. For example, we have just opened a chain a supermarkets with a wide but uninspiring range of brands and products (see the tiny piece of our product dimension table below)

NewImage

As you can see, our Product PK has no relationship at all to the type of product being sold. In our Kimball-style data warehouse we typically store the product key on the fact table and join to the product dimension to obtain all of the other product attributes and hierarchy members. This is essentially what we can do with join Attribute Clustering, in our example we can cluster our fact table on PRODUCT_CATEGORY so that all of the Laundry sales are physically close to each other in the Fact table.

Notice we are clustering on a join to the product dimension table’s “product_category” column, we are also clustering on sales_date, this is especially important in the case of partitioned fact tables so that the benefits of clustering align to the partitioning strategy.  We are also not restricted in our clustering to just one join, if we wanted to we could also cluster our sales by store region e.g. the Colorado laundry product sales are located in the same area of the sales table. To use Join Attribute Clustering we need to define the PK / FK relationships between fact and dimension, however it is always good practice to have that in place as it helps the CBO so much with query plan evaluation

Secondly, notice the BY LINEAR ORDER clause in the table DDL. Of the two ordering options, Linear Order is the most basic form of clustering, it this case we have our data structured so that all the items for a sales day are clustered together and within that cluster we order by product category and those categories are in turn ordered by store_id. The other way we can cluster is BY INTERLEAVED ORDER; here, Oracle maps a combination of dimensional values to a single cluster value using a z-order curve fitting approach. This sounds complex but it ensures that items that are frequently queried together are co-located in the disk blocks in the storage.

Interleaved ordering is probably the best choice for data warehousing at it aligns well with how we access data in our queries. Although we could include all of the dimension keys in our ordering list, it is going to be more benefit to just include a subset of dimensions; typically for retail I’d go with DATE (or something that correlates to the time based partition key of the fact table), the product  and the store. Of course we can again join to the dimension tables and cluster at higher hierarchy levels such as product category and store region. The Oracle 12c Data Warehousing guide gives some good advice, but you can’t go far wrong if you think about clustering items together that will be queried together

Clustering data can give us some advantages in performance. Better data compression and improved index range scans spring to mind, but to get most benefits we should also look at another new feature, zone-maps. Unlike Attribute Clustering, Zone Maps are Engineered Systems only, In a way they are similar to storage indexes already found on Exadata, but they have some additional advantages, they are also somewhat different from zone maps encountered in other DB vendors’ products such as Netezza.

In Exadata, a storage index can provide the maximum and minimum values encountered for a column in storage cell. I say “can” as there is no guarantee that range for a given column is held in the storage index. Zone Maps on the other hand will always provide maxima and minima for all of the columns specified at zone map creation. The zone map is orientated in terms of contiguous database blocks and is materialized so that it is physically persisted in the database and thus survives DB startups. Like Materialized views Materialized zone maps can become stale and need to be maintained.

We can define a zone map on one or more table columns and just like Attribute Clustering we may also create zone maps on table joins. As a table can only have one zone map it is important to include all of the columns you wish to track. Zone Maps are designed to work well with attribute clustering, in fact it is just a simple DDL statement to add a zone-map to an Attribute Clustered table so that the zone map tracks the same attributes as the clustering. This is where we get the major performance boost from attribute clustering, Instead of looking at the whole table the zone map tells us which ranges of database blocks contain data that matches our query predicates.

Zone Maps with Attribute Clustering gives us another powerful tool to boost DW performance on Exadata – we can do star queries without resorting to bitmap indexes and we minimise IO when scanning fact tables as we only need look where we know the data to be. Exciting times!

Tags: ,

Comments

  1. Dustin Says:

    I don’t come from the Oracle world, but I like your article. However, did you know that these concepts of Attribute Clustering and Zone Maps have been in use for years by Netezza/IBM Pure Data System of Analytics. Believe it or not, but the terminology is very much the same DISTRIBUTE ON and ZONE MAPS and the technology proved to the cornerstone of why Netezza outperformed Oracle in our head-to-head presales testing with user provided data.

    I just wonder how the work of executing a query will get subdivided across the hardware that serves as the back end for an Oracle implementation. Better yet, what sort of maintenance will be required to support the Attribute Clustering and Zone Mapping in Oracle.

    I mean you can copy the terminology and the even some of the technology of Netezza, but the amount of query acceleration that is a direct result of the hardware layer being purposefully integrated with the software is something I don’t think you can do as easily nor can it be repeated and require such little effort to maintain.

    Netezza caused a tidal shift in the way RDBMSs are packaged and how they deal with data. That tide seems to be following in the opposite direction now that the competition has adopted some of the Netezza’s secret sauce (the opposite is true as well).

    Anyway a response is a appreciated. Thanks.

  2. Peter Scott Says:

    @Dustin
    I did mention Netezza in the blog.
    Before I joined Rittman Mead in 2007, I used to work for a major global SI, where we partnered with Netzeeza on a Teradata conversion so I am well aware of their pioneering work on data distribution and zone maps to access data smarter.

    Netezza and Oracle take radically different approaches to how storage relates to processing and how to scale systems – the whole architecture is different and not just the parts with similar names. For example you ask how a query gets sub-divded across hardware – that is more a share-nothing architecture issue than on Oracle with its parallel processing shared storage architecture; storage indexes and zone maps go some way at telling the query engine where NOT to look but they are just part of the picture

    Yes, there could be the need for attribute clustering and zone map maintenance in Oracle – the clustering is direct path insert only so if any updated or row based inserts occur there is a possibility that the row is not in correct cluster location. Likewise adding new data to a zone is likely to affect the zone maxima and minima, this will make the zone map stale and thus not used. However, just as with their structural cousin the materilazed view we can partition a zone map so that only subsections of the zone map can be stale. It is simple db call to refresh a zonemap.
    As always with such features you need to decide if they are right for you, and then implement a sound design to exploit them.

    Knowing what features do and their limitations is key on any platform (including Neteeza)

Website Design & Build: tymedia.co.uk