##
Apache Spark itself

**1. MLlib**
##

##

##
AMPLab

Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.

####

Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.

**2. MLI**
**3. ML Optimizer (aka Ghostface)**
Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status.

####

####

####
Other than ML Base

**4. Splash**
A recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching.

**5. Keystone ML**
Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are

limitations I previously blogged about.

**6. Velox**
A server to manage a large collection of machine learning models.

**7. CoCoA**
Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper

Communication-Efficient Distributed Dual Coordinate Ascent
##

##

##
Frameworks

####

####

####
GPU-based

**8. DeepLearning4j**
I previously blogged

DeepLearning4j Adds Spark GPU Support
**9. Elephas**
Brand new and frankly why I started this list for this blog post. Provides an interface to

Keras.

####

####

####
Non-GPU-based

**10. DistML**
Parameter server for model-parallel rather than data-parallel (as Spark's MLlib is).

**11. Aerosolve**
From Airbnb, used in their automated pricing

**12. Zen**
Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines

**13. Distributed Data Frame**
Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries.

##

##

##
Interfaces to other Machine Learning systems

**14. spark-corenlp**
Wraps Stanford

CoreNLP.

**15. Sparkit-learn**
Interface to Python's

Scikit-learn
**16. Sparkling Water**
Interface to

H_{2}O
**17. hivemall-spark**
Wraps

Hivemall, machine learning in Hive

**18. spark-pmml-exporter-validator**
Export PMML, an industry standard XML format for transporting machine learning models.

##

##

##
Add-ons that enhance MLlib's existing algorithms

**19. MLlib-dropout**
Adds dropout capability to Spark MLLib, based on the paper

Dropout: A simple way to prevent neural networks from overfitting.

**20. generalized-kmeans-clustering**
Adds arbitrary distance functions to K-Means

**21. spark-ml-streaming**
Visualize the Streaming Machine Learning algorithms built into Spark MLlib

##

##

##
Algorithms

####

####

####
Supervised learning

**22. spark-libFM**
Factorization Machines

**23. ScalaNetwork**
Recursive Neural Networks (RNNs)

**24. dissolve-struct**
SVM based on the performant Spark communication framework CoCoA listed above.

**25. Sparkling Ferns**
Based on

Image Classification using Random Forests and Ferns
**26. streaming-matrix-factorization**
Matrix Factorization Recommendation System

####

####

####
Unsupervised learning

**27. PatchWork**
40x faster clustering than Spark MLlib K-Means

**28. Bisecting K-Meams Clustering**
K-Means that produces more uniformly-sized clusters, based on

A Comparison of Document Clustering Techniques
**29. spark-knn-graphs**
Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH)

**30. TopicModeling**
Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP)

####

####

####
Algorithm building blocks

**31. sparkboost**
Adaboost and MP-Boost

**32. spark-tfocs**
Port to Spark of

TFOCS: Templates for First-Order Conic Solvers. If your machine learning cost function happens to be convex, then TFOCS can solve it.

**33. lazy-linalg**
Linear algebra operators to work with Spark MLlib's linalg package

##

##

##
Feature extractors

**34. spark-infotheoretic-feature-selection**
Information-theoretic basis for feature selection, based on

Conditional likelihood maximisation: a unifying framework for information theoretic feature selection
**35. spark-MDLP-discretization**
Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on

Multi-interval discretization of continuous-valued attributes for classification learning.

**36. spark-tsne**
Distributed

t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction.

**37. modelmatrix**
Sparse feature vectors

##

##

##
Domain-specific

**38. Spatial and time-series data**
K-Means, Regression, and Statistics

**39. Twitter data**
**UPDATE 2015-09-30:** Although it was a

reddit.com post regarding the Spark deep learning framework Elephas that kicked me off compiling this list, most of the rest comes from

AMPLab and

spark-packages.org, plus a couple came from memory. Check AMPLab and spark-packages.org for future updates (since this blog post is a static list). And for tips on how to keep up in general on the fast-moving Spark Ecosystem, see my

10-minute presentation from February, 2015 (scroll down to the second presentation of that mini-conference).