Spark 1.5 was released today. Of the 1,516 Jira tickets that comprise the 1.5 release, I have highlighted a few important ones below, broken down by major Spark component.
The first major phase of Project Tungsten (aside from a small portion that went into 1.4)
Data locality for reduce
Prior to the Spark 1.2 release, I jumped the gun and announced this made it into Spark 1.2. In reality, it didn't make it in until Spark 1.5.
MLLibThe developers behind Spark are focusing their efforts on the spark.ml package, which supports pipelines, and are in the long process of transferring all the spark.mllib functionality over to spark.ml. For the Spark 1.5 release, the big exception to this rule is the large set of improvements to LDA in spark.mllib.
MultilayerPerceptronClassifier (Neural Networks!)
Not deep learning, but Spark finally gets native Artificial Neural Networks. For deep learning on Spark, there is DeepLearning4j (with GPU support no less!)
SPARK-9471 -- MultilayerPerceptronClassifier API
New ml.feature feature utilities
DCT (Discrete Cosine Transform)
Useful for those other types of data (sound, images, and video).
SPARK-8471 -- DCT API
In Natural Language Processing, for grouping letters or words into groups of size N.
SPARK-8455 -- NGram API
Whereas SparkR allows you to use MLlib from R, RFormula allows you to use R from MLlib.
SPARK-9201 -- RFormula API
For Natural Language Processing, remove "a", "the", etc. from text.
SPARK-8168 -- StopWordsRemover API
- DCT (Discrete Cosine Transform)
New to spark.ml (formerly only in spark.mllib)
-- PCA API
Isotonic regression is non-linear regression, with the constraint that the fit curve is monotonic increasing.
SPARK-8671 -- IsotonicRegression API
SPARK-8600 -- NaiveBayes
Word2Vec was previously in both spark.ml and spark.mllib, but the spark.ml Word2Vec was missing the important function findSynonyms().
SPARK-8874 -- spark.ml.feature.Word2VecModel API
A powerful new function to determine the most important features in the random forest was added to the spark.ml version of Random Forest, but not to the spark.mllib version.
SPARK-5133 API -- RandomForestClassificationModel API
spark.mllibLDA received a ton of upgrades:
SPARK-8936 -- OnlineLDAOptimizer API, new getOptimizeDocConcentration() function (known as "alpha" in the LDA literature)
SPARK-8536 -- LDA API, an overloaded setAlpha() now takes Vector instead of Double
SPARK-6793 -- LocalLDAModel API, logPerplexity()
SPARK-5567 -- LocalLDAModel API, topicDistributions()
SPARK-5989 -- LDAModel API
GraphXOther than bug fixes and minor improvements, GraphX did not get upgraded in Spark 1.5. However, one of the spark.mllib functions can now accept a Graph as input:
SPARK-7254 -- PowerIterationClustering API, an overloaded run() now takes Graph
New scheduling mechanism
E.g. a job no longer fails if a receiver fails 4 times, and it is now possible to schedule all receivers to run on a particular node (e.g. for rack locality to a Kafka cluster)
SPARK-8882 and SPARK-7988
Dynamic Rate Controller
While it was previously possible to set a rate limit in Spark Streaming, Spark 1.5 introduces a dynamic and automatic rate limiter. There is no API; it's just automatic. An API to provide configuration and flexiblity did not make it into 1.5.
Spark SQLHalf the Spark 1.5 Jira tickets concerned Spark SQL, but almost all of them were miscellaneous bug fixes and performance improvements. Two notable exceptions are:
Already described above for Spark Core, Project Tungsten is really aimed at Spark SQL primarily. The developers behind Spark are aiming their performance improvements primarily at DataFrame from Spark SQL and only secondarily to plain old RDDs. Ever since the Spark 1.3 release, they have been positioning DataFrame as an eventual kind of replacement for RDD, for several reasons:
- Because a DataFrame can be populated by a query, Spark can create a database-like query plan which it can optimize.
- Spark SQL allows queries to be written in SQL, which may be easier for many (especially from the REPL Spark Shell) than Scala.
- SQL is more compact than Scala.
Data Source API improvement
The Data Source API allows Spark SQL to connect to external data sources. 19 jira tickets constitute the Spark 1.5 improvements SPARK-5180, with 5 more slated for Spark 1.6 SPARK-9932.