Wednesday, September 9, 2015

Breakdown of Spark 1.5 Improvements


Spark 1.5 was released today. Of the 1,516 Jira tickets that comprise the 1.5 release, I have highlighted a few important ones below, broken down by major Spark component.

Spark Core

  • Project Tungsten
    The first major phase of Project Tungsten (aside from a small portion that went into 1.4)
    SPARK-7075
  • Data locality for reduce
    Prior to the Spark 1.2 release, I jumped the gun and announced this made it into Spark 1.2. In reality, it didn't make it in until Spark 1.5.
    SPARK-2774

MLLib

The developers behind Spark are focusing their efforts on the spark.ml package, which supports pipelines, and are in the long process of transferring all the spark.mllib functionality over to spark.ml. For the Spark 1.5 release, the big exception to this rule is the large set of improvements to LDA in spark.mllib.

spark.ml

spark.mllib

LDA received a ton of upgrades:

GraphX

Other than bug fixes and minor improvements, GraphX did not get upgraded in Spark 1.5. However, one of the spark.mllib functions can now accept a Graph as input:

Spark Streaming

  • New scheduling mechanism
    E.g. a job no longer fails if a receiver fails 4 times, and it is now possible to schedule all receivers to run on a particular node (e.g. for rack locality to a Kafka cluster)
    SPARK-8882 and SPARK-7988
  • Dynamic Rate Controller
    While it was previously possible to set a rate limit in Spark Streaming, Spark 1.5 introduces a dynamic and automatic rate limiter. There is no API; it's just automatic. An API to provide configuration and flexiblity did not make it into 1.5.
    SPARK-8834

Spark SQL

Half the Spark 1.5 Jira tickets concerned Spark SQL, but almost all of them were miscellaneous bug fixes and performance improvements. Two notable exceptions are:
  • Project Tungsten
    Already described above for Spark Core, Project Tungsten is really aimed at Spark SQL primarily. The developers behind Spark are aiming their performance improvements primarily at DataFrame from Spark SQL and only secondarily to plain old RDDs. Ever since the Spark 1.3 release, they have been positioning DataFrame as an eventual kind of replacement for RDD, for several reasons:
    1. Because a DataFrame can be populated by a query, Spark can create a database-like query plan which it can optimize.
    2. Spark SQL allows queries to be written in SQL, which may be easier for many (especially from the REPL Spark Shell) than Scala.
    3. SQL is more compact than Scala.

  • Data Source API improvement
    The Data Source API allows Spark SQL to connect to external data sources. 19 jira tickets constitute the Spark 1.5 improvements SPARK-5180, with 5 more slated for Spark 1.6 SPARK-9932.

No comments: