April 29th, 2014 by Mark Rittman
It’s just over a week until the first of the two Rittman Mead BI Forum 2014 events take place in Brighton and Atlanta, and one of the highlights of the events for me will be Cloudera’s Lars George’s Hadoop Masterclass. Hadoop and Big Data are two technologies that are becoming increasingly important to the worlds of Oracle BI and data warehousing, so this is an excellent opportunity to learn some basics, get some tips from an expert in the field, and then use the rest of the week to relate it all back to core OBIEE, ODI and Oracle Database.
Lars’ masterclass is running on the Wednesday before each event, on May 7th at the Brighton Hotel Seattle and then the week after, on Wednesday 14th May at the Renaissance Atlanta Midtown Hotel. Attendance for the masterclass is just £275 + VAT for the Brighton event, or $500 for the Atlanta event, but you can only book it as part of the overall BI Forum – so book-up now while there are still places left! In the meantime, here’s details of what’s in the masterclass:
Session 1 – Introduction to Hadoop
The first session of the day sets the stage for the following ones. We will look into the history of Hadoop, where it comes from, and how it made its way into the open-source world. Part of this overview are the basic building blocks in Hadoop, the file system HDFS and the batch processing system MapReduce.
Then we will look into the flourishing ecosystem that is now the larger Hadoop project, with its various components and how they help forming a complete data processing stack. We also briefly look into how Hadoop based distributions help today tying the various components together in a reliable manner based on predictable release cycles.
Finally we will pick up the topic of cluster design and planning, talking about the major decision points when provisioning a Hadoop cluster. This includes the hardware considerations depending on specific use-cases as well as how to deploy the framework on the cluster once it is operational.
Session 2 – Ingress and Egress
The second session dives into the use of the platform as part of an Enterprise Data Hub, i.e. the central storage and processing location for all of the data in a company (large, medium, or small). We will discuss how data is acquired into the data hub and provisioned for further access and processing. There are various tools that allow the importing of data from single event based systems to relational database management systems.
As data is stored the user has to make decisions how to store the data for further processing, since that can drive the performance implications considerably. In state-of-the-art data processing pipelines there are usually hybrid approaches that combine lightweight LT (no E for “extract” needed), i.e. transformations, with optimised data formats as the final location for fast subsequent processing. Continuous and reliable data collection is vital for productionising the initial proof-of-concept pipelines.
Towards the end we will also look at the lower level APIs available for data consumption, rounding off the set of available tools for a Hadoop practitioner.
Session 3 – NoSQL and Hadoop
For certain use-cases there is an inherent need for something more “database” like compared to the features offered by the original Hadoop components, i.e. file system and batch processing. Especially for slow changing dimensions and entities in general there is a need for updating previously stored data as time progresses. This is where HBase, the Hadoop Database, comes in and allows for random reads and writes to existing rows of data, or entities in a table.
We will dive into the architecture of HBase to derive the need for proper schema design, one of the key tasks implementing a HBase backed solution. Similar to the file formats from session 2, HBase allows to freely design table layouts which can lead to suboptimal performance. This session will introduce the major access patterns observed in practice and explain how they can play to HBase’s strengths.
Finally a set of real-world examples will show how fellow HBase users (e.g. Facebook) have gone through various modification of their schema design before arriving at their current setup. Available open-source projects show further schema designs that will help coming to terms with this involved topic.
Session 4 – Analysing Big Data
The final session of the day tackles the processing of data, since so far we have learned mostly about the storage and preparation of data for subsequent handling. We will look into the existing frameworks atop of Hadoop and how they offer distinct (but sometimes also overlapping) functionalities. There are frameworks that run as separate instance but also higher level abstractions on top of those that help developers and data wranglers of all kinds to find their right weapon of choice.
Using all of the learned the user will then see how the various tools can be combined to built the promised reliable data processing pipelines, landing data in the Enterprise Data Hub and using automatisms to start the said subsequent processing without any human intervention. The closing information provided in this session will look into the external interfaces, such as JDBC/ODBC, enabling the visualisation of the computed and analysed data in appealing UI based tools.
- Session 1 – Introduction to Hadoop
- Introduction to Hadoop
- Explain pedigree, history
- Explain and show HDFS, MapReduce, Cloudera Manager
- The Hadoop Ecosystem
- Show need for other projects within Hadoop
- Ingress, egress, random access, security
- Cluster Design and Planning
- Introduce concepts on how to scope out a cluster
- Typical hardware configuration
- Deployment methods
- Session 2 – Ingress and Egress
- Explain Flume, Sqoop to load data record based or in bulk
- Data formats and serialisation
- SequenceFile, Avro, Parquet
- Continuous data collection methods
- Interfaces for data retrieval (lower level)
- Session 3 – NoSQL and Hadoop
- HBase Introduction
- Schema design
- Access patterns
- Use-cases examples
- Session 4 – Analysing Big Data
- Processing frameworks
- Explain and show MapReduce, YARN, Spark, Solr
- High level abstractions
- Hive, Pig, CrunchImpalaSearch
- Datapipelines in Hadoop
- Explain Oozie, Crunch
- Access to data for existing systems