Going Beyond MapReduce for Hadoop ETL Pt.3 : Introducing Apache Spark

Mark Rittman

Dec 9, 2014 • 7 min read

In the first two posts in this three part series on going beyond MapReduce for Hadoop ETL, we looked at why MapReduce and Hadoop 1.0 was only really suitable for batch processing, and how the new Apache Tez framework enabled by Apache YARN on the Hadoop 2.0 platform can be swapped-in for MapReduce to improve the performance of existing Pig and Hive scripts. Today though in the final post I want to take a look at Apache Spark, the next-gen compute framework that Cloudera are backing as the long-term successor to MapReduce.

Like Tez, Apache Spark supports DAGs that describe the entire dataflow process, not just individual map and reduce jobs. Like Pig, it has a concept of datasets (Pig’s aliases and relations), but crucially these datasets (RDDs, or “resilient distributed datasets”) can be cached in-memory, fail-back gracefully to disk and can be rebuilt using a graph that says how to reconstruct. With Tez, individual jobs in the DAG can now hand-off their output to the next job in-memory rather than having to stage in HDFS, but Spark uses memory for the actual datasets and is a much better choice for the types of iterative, machine-learning tasks that you tend to do on Hadoop systems. Moreover, Spark has arguably a richer API and when used with Scala, a functional programming-orientated language that uses Java libraries and whose collections framework maps well on to the types of operations you’d want to make use of with dataflow-type applications on a cluster.

Spark can run standalone, on YARN or on other cluster management platforms, and comes with a handy command-line interpreter that you can use to interactively load, filter, analyse and work with RDDs. Cloudera CDH5.2 comes with Spark 1.0.1 and can either be configured standalone or to run on YARN, with Spark as a service added to nodes in the cluster using Cloudera Manager.

So looking back at the Pig example, we create the dataflow using a number of aliases in that case, that we progressively filter, transform, join together and then aggregate to get to the final top ten set of pages from the website logs. Translating that dataflow to Spark we end up with a similar set of RDDs that take our initial set of logs, apply transformations and join the datasets to store the final aggregated output back on HDFS.

Spark supports in-memory sharing of data within a single DAG (i.e. RDD to RDD), but also between DAGs running in the same Spark instance. As such, Spark becomes a great framework for doing iterative and cyclic data analysis, and can make much better use of the memory on cluster servers whilst still using disk for overflow data and persistence.

Moreover, Spark powers a number of higher-level tools build on the core Spark engine to provide features like real-time loading and analysis (Spark Streaming), SQL access and integration with Hive (Spark SQL), machine learning (MLib) and so forth. In fact, as well as Hive and Pig being reworked to run on Tez there’s also projects underway to port them both to Spark, though to be honest they’re both at early stages compared to Tez integration and most probably you’ll be using Scala, Java or Python to work with Spark now.

So taking the Pig script we had earlier and translating that to the same dataflow in Spark and Scala, we end up with something like this:

package com.cloudera.analyzeblog
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SQLContext
case class accessLogRow(host: String, identity: String, user: String, time: String, request: String, status: String, size: String, referer: String, agent: String)
case class pageRow(host: String, request_page: String, status: String, agent: String)
case class postRow(post_id: String, title: String, post_date: String, post_type: String, author: String, url: String, generated_url: String)
object analyzeBlog {
        def getRequestUrl(s: String): String = {
        try {
                s.split(' ')(1)
        } catch {
                case e: ArrayIndexOutOfBoundsException => { "N/A" }
        }
}
        def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("analyzeBlog"))
val sqlContext = new SQLContext(sc)
import sqlContext._
val raw_logs = "/user/mrittman/rm_logs"
//val rowRegex = """^([0-9.]+)\s([\w.-]+)\s([\w.-]+)\s(\[[^\[\]]+\])\s"((?:[^"]|\")+)"\s(\d{3})\s(\d+|-)\s"((?:[^"]|\")+)"\s"((?:[^"]|\")+)"$""".r
val rowRegex = """^([\d.]+) (\S+) (\S+) \[([\w\d:/]+\s[+\-]\d{4})\] "(.+?)" (\d{3}) ([\d\-]+) "([^"]+)" "([^"]+)".*""".r

val logs_base = sc.textFile(raw_logs) flatMap {
                        case rowRegex(host, identity, user, time, request, status, size, referer, agent) =>
                                Seq(accessLogRow(host, identity, user, time, request, status, size, referer, agent))
                        case _ => Nil
                                }
val logs_base_nobots = logs_base.filter( r => ! r.request.matches(".*(spider|robot|bot|slurp|bot|monitis|Baiduspider|AhrefsBot|EasouSpider|HTTrack|Uptime|FeedFetcher|dummy).*"))

val logs_base_page = logs_base_nobots.map { r =>
  val request = getRequestUrl(r.request)
  val request_formatted = if (request.charAt(request.length-1).toString == "/") request else request.concat("/")
  (r.host, request_formatted, r.status, r.agent)
}

val logs_base_page_schemaRDD = logs_base_page.map(p => pageRow(p._1, p._2, p._3, p._4))

logs_base_page_schemaRDD.registerAsTable("logs_base_page")

val page_count = sql("SELECT request_page, count(*) as hits FROM logs_base_page GROUP BY request_page").registerAsTable("page_count")

val postsLocation = "/user/mrittman/posts.psv"

val posts = sc.textFile(postsLocation).map{ line =>
        val cols=line.split('|')

        postRow(cols(0),cols(1),cols(2),cols(3),cols(4),cols(5),cols(6).concat("/"))
}

posts.registerAsTable("posts")

val pages_and_posts_details = sql("SELECT p.request_page, p.hits, ps.title, ps.author FROM page_count p JOIN posts ps ON p.request_page = ps.generated_url ORDER BY hits DESC LIMIT 10")

pages_and_posts_details.saveAsTextFile("/user/mrittman/top_10_pages_and_author4")

        }
}

I’ll do a code-walkthrough for this Spark application in a future post, but for now note the map and flatMap Scala collection functions used to transform RDDs, and the sql(“…”) function that allows us to register RDDs as tables and then manipulate the contents using SQL, including joining to other RDDs registered as tables. For now though, let’s run the application on the CDH5.2 using YARN and see how long it takes to process the same set of log files (remember, the Pig script on this CDH5.2 cluster took around 5 minutes to run, and the Pig on Tez version on the Hortonworks cluster was around 2.5 minutes:

[mrittman@bdanode1 analyzeBlog]$ spark-submit --class com.cloudera.analyzeblog.analyzeBlog --master yarn target/analyzeblog-0.0.1-SNAPSHOT.jar 
SLF4J: Class path contains multiple SLF4J bindings.
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/spark-assembly-1.1.0-cdh5.2.0-hadoop2.5.0-cdh5.2.0.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: Found binding in [jar:file:/opt/cloudera/parcels/CDH-5.2.0-1.cdh5.2.0.p0.36/jars/slf4j-log4j12-1.7.5.jar!/org/slf4j/impl/StaticLoggerBinder.class]
SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/12/06 19:18:25 INFO SecurityManager: Changing view acls to: mrittman
14/12/06 19:18:25 INFO SecurityManager: Changing modify acls to: mrittman
...
14/12/06 19:19:41 INFO DAGScheduler: Stage 0 (takeOrdered at basicOperators.scala:171) finished in 3.585 s
14/12/06 19:19:41 INFO SparkContext: Job finished: takeOrdered at basicOperators.scala:171, took 53.591560036 s
14/12/06 19:19:41 INFO SparkContext: Starting job: saveAsTextFile at analyzeBlog.scala:56
14/12/06 19:19:41 INFO DAGScheduler: Got job 1 (saveAsTextFile at analyzeBlog.scala:56) with 1 output partitions (allowLocal=false)
14/12/06 19:19:41 INFO DAGScheduler: Final stage: Stage 3(saveAsTextFile at analyzeBlog.scala:56)
14/12/06 19:19:41 INFO DAGScheduler: Parents of final stage: List()
14/12/06 19:19:41 INFO DAGScheduler: Missing parents: List()
14/12/06 19:19:41 INFO DAGScheduler: Submitting Stage 3 (MappedRDD[15] at saveAsTextFile at analyzeBlog.scala:56), which has no missing parents
14/12/06 19:19:42 INFO MemoryStore: ensureFreeSpace(64080) called with curMem=407084, maxMem=278302556
14/12/06 19:19:42 INFO MemoryStore: Block broadcast_5 stored as values in memory (estimated size 62.6 KB, free 265.0 MB)
14/12/06 19:19:42 INFO MemoryStore: ensureFreeSpace(22386) called with curMem=471164, maxMem=278302556
14/12/06 19:19:42 INFO MemoryStore: Block broadcast_5_piece0 stored as bytes in memory (estimated size 21.9 KB, free 264.9 MB)
14/12/06 19:19:42 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on bdanode1.rittmandev.com:44486 (size: 21.9 KB, free: 265.3 MB)
14/12/06 19:19:42 INFO BlockManagerMaster: Updated info of block broadcast_5_piece0
14/12/06 19:19:42 INFO DAGScheduler: Submitting 1 missing tasks from Stage 3 (MappedRDD[15] at saveAsTextFile at analyzeBlog.scala:56)
14/12/06 19:19:42 INFO YarnClientClusterScheduler: Adding task set 3.0 with 1 tasks
14/12/06 19:19:42 INFO TaskSetManager: Starting task 0.0 in stage 3.0 (TID 215, bdanode5.rittmandev.com, PROCESS_LOCAL, 3331 bytes)
14/12/06 19:19:42 INFO BlockManagerInfo: Added broadcast_5_piece0 in memory on bdanode5.rittmandev.com:13962 (size: 21.9 KB, free: 530.2 MB)
14/12/06 19:19:42 INFO TaskSetManager: Finished task 0.0 in stage 3.0 (TID 215) in 311 ms on bdanode5.rittmandev.com (1/1)
14/12/06 19:19:42 INFO YarnClientClusterScheduler: Removed TaskSet 3.0, whose tasks have all completed, from pool 
14/12/06 19:19:42 INFO DAGScheduler: Stage 3 (saveAsTextFile at analyzeBlog.scala:56) finished in 0.312 s
14/12/06 19:19:42 INFO SparkContext: Job finished: saveAsTextFile at analyzeBlog.scala:56, took 0.373096676 s

It ran in just over a minute in the end, and most of that was around submitting the job to YARN - not bad. We’ll be covering more of Spark on the blog over the next few weeks including streaming and machine learning examples, and connecting it to ODI and OBIEE via Hive on Spark, and Spark SQL’s own Hive-compatible Thrift server. I’ll also be taking a look at Pig on Spark (or “Spork”…) to see how well that works, and most interestingly how Pig and Hive on Spark compares to running them on Tez - watch this space as they say.

Sign up for more like this.