Tuesday, September 29, 2015

39 Machine Learning Libraries for Spark, Categorized

Apache Spark itself

1. MLlib

AMPLab

Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.

ML Base

Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.

2. MLI

3. ML Optimizer (aka Ghostface)
Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status.

Other than ML Base

4. Splash
A recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching.

5. Keystone ML
Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are limitations I previously blogged about.

6. Velox
A server to manage a large collection of machine learning models.

7. CoCoA
Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper Communication-Efficient Distributed Dual Coordinate Ascent

Frameworks

GPU-based

8. DeepLearning4j
I previously blogged DeepLearning4j Adds Spark GPU Support

9. Elephas
Brand new and frankly why I started this list for this blog post. Provides an interface to Keras.

Non-GPU-based

10. DistML
Parameter server for model-parallel rather than data-parallel (as Spark's MLlib is).

11. Aerosolve
From Airbnb, used in their automated pricing

12. Zen
Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines

13. Distributed Data Frame
Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries.

Interfaces to other Machine Learning systems

14. spark-corenlp
Wraps Stanford CoreNLP.

15. Sparkit-learn
Interface to Python's Scikit-learn

16. Sparkling Water
Interface to H₂O

17. hivemall-spark
Wraps Hivemall, machine learning in Hive

18. spark-pmml-exporter-validator
Export PMML, an industry standard XML format for transporting machine learning models.

Add-ons that enhance MLlib's existing algorithms

19. MLlib-dropout
Adds dropout capability to Spark MLLib, based on the paper Dropout: A simple way to prevent neural networks from overfitting.

20. generalized-kmeans-clustering
Adds arbitrary distance functions to K-Means

21. spark-ml-streaming
Visualize the Streaming Machine Learning algorithms built into Spark MLlib

Algorithms

Supervised learning

22. spark-libFM
Factorization Machines

23. ScalaNetwork
Recursive Neural Networks (RNNs)

24. dissolve-struct
SVM based on the performant Spark communication framework CoCoA listed above.

25. Sparkling Ferns
Based on Image Classification using Random Forests and Ferns

26. streaming-matrix-factorization
Matrix Factorization Recommendation System

Unsupervised learning

27. PatchWork
40x faster clustering than Spark MLlib K-Means

28. Bisecting K-Meams Clustering
K-Means that produces more uniformly-sized clusters, based on A Comparison of Document Clustering Techniques

29. spark-knn-graphs
Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH)

30. TopicModeling
Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP)

Algorithm building blocks

31. sparkboost
Adaboost and MP-Boost

32. spark-tfocs
Port to Spark of TFOCS: Templates for First-Order Conic Solvers. If your machine learning cost function happens to be convex, then TFOCS can solve it.

33. lazy-linalg
Linear algebra operators to work with Spark MLlib's linalg package

Feature extractors

34. spark-infotheoretic-feature-selection
Information-theoretic basis for feature selection, based on Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

35. spark-MDLP-discretization
Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on Multi-interval discretization of continuous-valued attributes for classification learning.

36. spark-tsne
Distributed t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction.

37. modelmatrix
Sparse feature vectors

Domain-specific

38. Spatial and time-series data
K-Means, Regression, and Statistics

39. Twitter data

UPDATE 2015-09-30: Although it was a reddit.com post regarding the Spark deep learning framework Elephas that kicked me off compiling this list, most of the rest comes from AMPLab and spark-packages.org, plus a couple came from memory. Check AMPLab and spark-packages.org for future updates (since this blog post is a static list). And for tips on how to keep up in general on the fast-moving Spark Ecosystem, see my 10-minute presentation from February, 2015 (scroll down to the second presentation of that mini-conference).

Tuesday, September 22, 2015

Four ways to retrieve Scala Option value

Suppose you have the Scala value assignment below, and wish to append " World", but only if the value is not None (from Option[]).

val s = Some("Hello")

There are four different Option[] idioms to accomplish this:

The Java-like way

if (s.isDefined) s.get + " World" else "No hello"
The Pattern Matching way

s match { case Some(s) => s + " World"; case _ => "No hello" }
The Classic Scala way

s.map(_ + " World").getOrElse("No hello")
The Scala 2.10 way

s.fold("No hello")(_ + " World")

Wednesday, September 9, 2015

Breakdown of Spark 1.5 Improvements

Spark 1.5 was released today. Of the 1,516 Jira tickets that comprise the 1.5 release, I have highlighted a few important ones below, broken down by major Spark component.

Spark Core

Project Tungsten
The first major phase of Project Tungsten (aside from a small portion that went into 1.4)
SPARK-7075
Data locality for reduce
Prior to the Spark 1.2 release, I jumped the gun and announced this made it into Spark 1.2. In reality, it didn't make it in until Spark 1.5.
SPARK-2774

MLLib

The developers behind Spark are focusing their efforts on the spark.ml package, which supports pipelines, and are in the long process of transferring all the spark.mllib functionality over to spark.ml. For the Spark 1.5 release, the big exception to this rule is the large set of improvements to LDA in spark.mllib.

spark.ml

MultilayerPerceptronClassifier (Neural Networks!)
Not deep learning, but Spark finally gets native Artificial Neural Networks. For deep learning on Spark, there is DeepLearning4j (with GPU support no less!)
SPARK-9471 -- MultilayerPerceptronClassifier API
New ml.feature feature utilities
- DCT (Discrete Cosine Transform)
  Useful for those other types of data (sound, images, and video).
  SPARK-8471 -- DCT API
- NGram
  In Natural Language Processing, for grouping letters or words into groups of size N.
  SPARK-8455 -- NGram API
- RFormula
  Whereas SparkR allows you to use MLlib from R, RFormula allows you to use R from MLlib.
  SPARK-9201 -- RFormula API
- StopWordsRemover
  For Natural Language Processing, remove "a", "the", etc. from text.
  SPARK-8168 -- StopWordsRemover API
New to spark.ml (formerly only in spark.mllib)
- PCA
  -- PCA API
- Isotonic Regression
  Isotonic regression is non-linear regression, with the constraint that the fit curve is monotonic increasing.
  SPARK-8671 -- IsotonicRegression API
- NaiveBayes
  SPARK-8600 -- NaiveBayes
- Word2Vec findSynonyms()
  Word2Vec was previously in both spark.ml and spark.mllib, but the spark.ml Word2Vec was missing the important function findSynonyms().
  SPARK-8874 -- spark.ml.feature.Word2VecModel API
- RandomForest
  A powerful new function to determine the most important features in the random forest was added to the spark.ml version of Random Forest, but not to the spark.mllib version.
  SPARK-5133 API -- RandomForestClassificationModel API

spark.mllib

LDA received a ton of upgrades:

Hyperparameter estimation
SPARK-8936 -- OnlineLDAOptimizer API, new getOptimizeDocConcentration() function (known as "alpha" in the LDA literature)
Asymmetric priors
SPARK-8536 -- LDA API, an overloaded setAlpha() now takes Vector instead of Double
Perplexity
SPARK-6793 -- LocalLDAModel API, logPerplexity()
Prediction Methods
SPARK-5567 -- LocalLDAModel API, topicDistributions()
Model import/export
SPARK-5989 -- LDAModel API

GraphX

Other than bug fixes and minor improvements, GraphX did not get upgraded in Spark 1.5. However, one of the spark.mllib functions can now accept a Graph as input:

PowerIterationClustering
SPARK-7254 -- PowerIterationClustering API, an overloaded run() now takes Graph

Spark Streaming

New scheduling mechanism
E.g. a job no longer fails if a receiver fails 4 times, and it is now possible to schedule all receivers to run on a particular node (e.g. for rack locality to a Kafka cluster)
SPARK-8882 and SPARK-7988
Dynamic Rate Controller
While it was previously possible to set a rate limit in Spark Streaming, Spark 1.5 introduces a dynamic and automatic rate limiter. There is no API; it's just automatic. An API to provide configuration and flexiblity did not make it into 1.5.
SPARK-8834

Spark SQL

Half the Spark 1.5 Jira tickets concerned Spark SQL, but almost all of them were miscellaneous bug fixes and performance improvements. Two notable exceptions are:

Project Tungsten
Already described above for Spark Core, Project Tungsten is really aimed at Spark SQL primarily. The developers behind Spark are aiming their performance improvements primarily at DataFrame from Spark SQL and only secondarily to plain old RDDs. Ever since the Spark 1.3 release, they have been positioning DataFrame as an eventual kind of replacement for RDD, for several reasons:
1. Because a DataFrame can be populated by a query, Spark can create a database-like query plan which it can optimize.
2. Spark SQL allows queries to be written in SQL, which may be easier for many (especially from the REPL Spark Shell) than Scala.
3. SQL is more compact than Scala.
Data Source API improvement
The Data Source API allows Spark SQL to connect to external data sources. 19 jira tickets constitute the Spark 1.5 improvements SPARK-5180, with 5 more slated for Spark 1.6 SPARK-9932.