Thursday, December 17, 2015

GPU off Apache Spark roadmap: Deeplearning4j best bet for Spark GPU

Last night, Reynold Xin took SPARK-3785 "Support off-loading computations to a GPU" off the Apache Spark road map, marking it "Closed" with a resolution of "Later". This is a little different than when GPU was mentioned at Spark Summit in June, 2015 as a possibility for Project Tungsten for 1.6 and beyond.
So for now, the best bet for using GPUs on Spark is Deeplearning4j, from which their architecture diagram above came. As I've blogged previously, the DL4J folks are waiting until they have solid benchmarks before advertising them. Nevertheless, today, you can do deep learning on GPU-powered Spark.

Tuesday, December 1, 2015

Free book excerpt: Semi-Supervised Learning With GraphX

Manning Publications has made available for free an excerpt from my book Spark GraphX In Action. The excerpt is entitled Poor Man’s Training Data: Graph-Based Semi-Supervised Learning and shows how to:
  • Construct a graph from a collection of points using a K-Nearest Neighbors Graph Construction algorithm (not to be confused with KNN machine learning prediction, which actually gets used below)
  • Do the above in a way optimized for distributed computing.
  • Propagate labels to unlabeled nodes to achieve semi-supervised learning.
  • Make predictions from the trained model (using conventional KNN machine learning prediction)
And as part of Manning's site-wide MEAP sale for Cyber Monday week, the MEAP is 50% off today using the code dotd120115.
My co-author, Robin East, and I just finished the second draft this past weekend, so the print version should be available in 2016Q1.

Wednesday, November 11, 2015

Spark Streaming 1.6: Stop Using updateStateByKey()

Last night, Tathagata Das resolved SPARK-11290, "Implement trackStateByKey for improved state management", which will bring a 7x performance improvement to Spark Streaming when Spark 1.6 is released in December, 2015.
trackStateByKey() offers three benefits over updateStateByKey(), which has served as the workhorse of Spark Streaming since its inception in 2012:
  1. Internally, the performance improvement is achieved by looking at only the key/state pairs for which there is new data. The chart above, which comes from Tathagata's design document, illustrates a typical use case, where 4 million keys are being tracked (for example, 4 million concurrent users on a website or app, or streaming audio or video) but only 10,000 had some activity during the past micro-batch (of, say, two seconds duration). With updateStateByKey(), all 4 million key/state pairs would have had to have been touched due to updateStateByKey()'s internal use of cogroup.
  2. Capability to time out states is built in as a first class option. You no longer have to cobble together your own timeout mechanism. The downside is that if you use this option, you lose the performance improvement mentioned above because the timeout mechanism requires examining every key/state pair.
  3. Ability to return/emit values other than just state as a result of having examined the state. Returning to the example of tracking a web or app user, trackStateByKey() could be maintaining a running logged-in time and emit that together with some metadata for the purposes of populating a dashboard. Not only does one avoid dual-purposing the state per key for two different purposes, but the performance benefit of touching only the modified keys is also realized.

Wednesday, October 28, 2015

Spark 1.6: "Datasets" best of RDDs and Dataframes

If RDDs weren't killed by Dataframes in Spark 1.3 (as covered by my January, 2015 blog post Spark 1.3: Stop Using RDDs), surely they will be by Spark 1.6, which introduces Datasets.
As covered in Michael Armbrust's presentation today at Spark Summit Europe 2015Spark Dataframes: Simple and Fast Analysis of Structured Data, specifically in his last slide, Datasets (as originally proposed in the umbrella Jira ticket SPARK-9999) combine the best of both worlds of RDDs and Dataframes.
Datasets provide an RDD-like API, but with all the performance advantages given by Catalyst and Tungsten. groupBy(), for example, which was always a performance no-no on RDDs, can be done efficiently with Datasets without worrying for example about some groups being too large for a single node because Catalyst can spill large groups. And it's all strongly typed and type-safe.
Datasets will be previewed in Spark 1.6, due out in December.

Saturday, October 3, 2015

Minimal Scala pom.xml for Maven

You would think it would be easy to find an example pom.xml for Scala somewhere on the web. It's not. And the example one at doesn't work because its <sourcedir>src/main/java</sourcedir> excludes all your Scala files from src/main/scala!

Without further ado, below is a minimal pom.xml for Scala.
<project xmlns=""

Tuesday, September 29, 2015

39 Machine Learning Libraries for Spark, Categorized

Apache Spark itself

1. MLlib


Spark originally came out of Berkeley AMPLab and even today AMPLab projects, even though they are not in Apache Spark Foundation, enjoy a status a bit over your everyday github project.

ML Base

Spark's own MLLib forms the bottom layer of the three-layer ML Base, with MLI being the middle layer and ML Optimizer being the most abstract layer.

2. MLI

3. ML Optimizer (aka Ghostface)
Ghostware was described in 2014 but never released. Of the 39 machine learning libraries, this is the only one that is vaporware, and is included only due to its AMPLab and ML Base status.

Other than ML Base

4. Splash
A recent project from June, 2015, this set of stochastic learning algorithms claims 25x - 75x faster performance than Spark MLlib on Stochastic Gradient Descent (SGD). Plus it's an AMPLab project that begins with the letters "sp", so it's worth watching.

5. Keystone ML
Brought machine learning pipelines to Spark, but pipelines have matured in recent versions of Spark. Also promises some computer vision capability, but there are limitations I previously blogged about.

6. Velox
A server to manage a large collection of machine learning models.

7. CoCoA
Faster machine learning on Spark by optimizing communication patterns and shuffles, as described in the paper Communication-Efficient Distributed Dual Coordinate Ascent



8. DeepLearning4j
I previously blogged DeepLearning4j Adds Spark GPU Support

9. Elephas
Brand new and frankly why I started this list for this blog post. Provides an interface to Keras.


10. DistML
Parameter server for model-parallel rather than data-parallel (as Spark's MLlib is).

11. Aerosolve
From Airbnb, used in their automated pricing

12. Zen
Logistic regression, LDA, Factorization machines, Neural Network, Restricted Boltzmann Machines

13. Distributed Data Frame
Similar to Spark DataFrames, but agnostic to engine (i.e. will run on engines other than Spark in the future). Includes cross-validation and interfaces to external machine learning libraries.

Interfaces to other Machine Learning systems

14. spark-corenlp
Wraps Stanford CoreNLP.

15. Sparkit-learn
Interface to Python's Scikit-learn

16. Sparkling Water
Interface to H2O

17. hivemall-spark
Wraps Hivemall, machine learning in Hive

18. spark-pmml-exporter-validator
Export PMML, an industry standard XML format for transporting machine learning models.

Add-ons that enhance MLlib's existing algorithms

19. MLlib-dropout
Adds dropout capability to Spark MLLib, based on the paper Dropout: A simple way to prevent neural networks from overfitting.

20. generalized-kmeans-clustering
Adds arbitrary distance functions to K-Means

21. spark-ml-streaming
Visualize the Streaming Machine Learning algorithms built into Spark MLlib


Supervised learning

22. spark-libFM
Factorization Machines

23. ScalaNetwork
Recursive Neural Networks (RNNs)

24. dissolve-struct
SVM based on the performant Spark communication framework CoCoA listed above.

25. Sparkling Ferns
Based on Image Classification using Random Forests and Ferns

26. streaming-matrix-factorization
Matrix Factorization Recommendation System

Unsupervised learning

27. PatchWork
40x faster clustering than Spark MLlib K-Means

28. Bisecting K-Meams Clustering
K-Means that produces more uniformly-sized clusters, based on A Comparison of Document Clustering Techniques

29. spark-knn-graphs
Build graphs using k-nearest-neighbors and locality sensitive hashing (LSH)

30. TopicModeling
Online Latent Dirichlet Allocation (LDA), Gibbs Sampling LDA, Online Hierarchical Dirichlet Process (HDP)

Algorithm building blocks

31. sparkboost
Adaboost and MP-Boost

32. spark-tfocs
Port to Spark of TFOCS: Templates for First-Order Conic Solvers. If your machine learning cost function happens to be convex, then TFOCS can solve it.

33. lazy-linalg
Linear algebra operators to work with Spark MLlib's linalg package

Feature extractors

34. spark-infotheoretic-feature-selection
Information-theoretic basis for feature selection, based on Conditional likelihood maximisation: a unifying framework for information theoretic feature selection

35. spark-MDLP-discretization
Given labeled data, "discretize" one of the continuous numeric dimensions such that each bin is relatively homogenous in terms of data classes. This is a foundational idea CART and ID3 algorithms to generate decision trees. Based on Multi-interval discretization of continuous-valued attributes for classification learning.

36. spark-tsne
Distributed t-Distributed Stochastic Neighbor Embedding (t-SNE) for dimensionality reduction.

37. modelmatrix
Sparse feature vectors


38. Spatial and time-series data
K-Means, Regression, and Statistics

39. Twitter data

UPDATE 2015-09-30: Although it was a post regarding the Spark deep learning framework Elephas that kicked me off compiling this list, most of the rest comes from AMPLab and, plus a couple came from memory. Check AMPLab and for future updates (since this blog post is a static list). And for tips on how to keep up in general on the fast-moving Spark Ecosystem, see my 10-minute presentation from February, 2015 (scroll down to the second presentation of that mini-conference).

Tuesday, September 22, 2015

Four ways to retrieve Scala Option value

Suppose you have the Scala value assignment below, and wish to append " World", but only if the value is not None (from Option[]).

val s = Some("Hello")

There are four different Option[] idioms to accomplish this:
  1. The Java-like way

    if (s.isDefined) s.get + " World" else "No hello"
  2. The Pattern Matching way

    s match { case Some(s) => s + " World"; case _ => "No hello" }
  3. The Classic Scala way + " World").getOrElse("No hello")
  4. The Scala 2.10 way

    s.fold("No hello")(_ + " World")

Wednesday, September 9, 2015

Breakdown of Spark 1.5 Improvements

Spark 1.5 was released today. Of the 1,516 Jira tickets that comprise the 1.5 release, I have highlighted a few important ones below, broken down by major Spark component.

Spark Core

  • Project Tungsten
    The first major phase of Project Tungsten (aside from a small portion that went into 1.4)
  • Data locality for reduce
    Prior to the Spark 1.2 release, I jumped the gun and announced this made it into Spark 1.2. In reality, it didn't make it in until Spark 1.5.


The developers behind Spark are focusing their efforts on the package, which supports pipelines, and are in the long process of transferring all the spark.mllib functionality over to For the Spark 1.5 release, the big exception to this rule is the large set of improvements to LDA in spark.mllib.


LDA received a ton of upgrades:


Other than bug fixes and minor improvements, GraphX did not get upgraded in Spark 1.5. However, one of the spark.mllib functions can now accept a Graph as input:

Spark Streaming

  • New scheduling mechanism
    E.g. a job no longer fails if a receiver fails 4 times, and it is now possible to schedule all receivers to run on a particular node (e.g. for rack locality to a Kafka cluster)
    SPARK-8882 and SPARK-7988
  • Dynamic Rate Controller
    While it was previously possible to set a rate limit in Spark Streaming, Spark 1.5 introduces a dynamic and automatic rate limiter. There is no API; it's just automatic. An API to provide configuration and flexiblity did not make it into 1.5.

Spark SQL

Half the Spark 1.5 Jira tickets concerned Spark SQL, but almost all of them were miscellaneous bug fixes and performance improvements. Two notable exceptions are:
  • Project Tungsten
    Already described above for Spark Core, Project Tungsten is really aimed at Spark SQL primarily. The developers behind Spark are aiming their performance improvements primarily at DataFrame from Spark SQL and only secondarily to plain old RDDs. Ever since the Spark 1.3 release, they have been positioning DataFrame as an eventual kind of replacement for RDD, for several reasons:
    1. Because a DataFrame can be populated by a query, Spark can create a database-like query plan which it can optimize.
    2. Spark SQL allows queries to be written in SQL, which may be easier for many (especially from the REPL Spark Shell) than Scala.
    3. SQL is more compact than Scala.

  • Data Source API improvement
    The Data Source API allows Spark SQL to connect to external data sources. 19 jira tickets constitute the Spark 1.5 improvements SPARK-5180, with 5 more slated for Spark 1.6 SPARK-9932.

Wednesday, June 24, 2015

My Spark Summit Presentation On Word2Vec and Semi-Supervised Learning


MLLib Word2Vec is an unsupervised learning technique that can generate vectors of features that can then be clustered. But the weakness of unsupervised learning is that although it can say an apple is close to a banana, it can’t put the label of “fruit” on that group. We show how MLLib Word2Vec can be combined with the human-created data of YAGO2 (which is derived from the crowd-sourced Wikipedia metadata), along with the NLP metrics Levenshtein and Jaccard, to properly label categories. As an alternative to GraphX even though YAGO2 is a graph, we make use of Ankur Dave’s powerful IndexedRDD, which is slated for inclusion in Spark 1.3 or 1.4. IndexedRDD is also used in a second way: to further parallelize MLLib Word2Vec. The use case is labeling columns of unlabeled data uploaded to the Oracle Data Enrichment Cloud Service (ODECS) cloud app, which processes big data in the cloud.



Wednesday, June 17, 2015

Spark 1.4 for Data Scientists; Spark 1.5 & 1.6 for core improvements

The theme at Spark Summit 2015 this week can be boiled down to "Spark 1.4 is for data scientists". The "first new supported language in over a year" is the highlight of Spark 1.4: SparkR, originally an AMPLab project, is now part of the Apache Spark distribution. Another data science improvement is Spark ML (which has the ML pipelines, and also which may eventually replace Spark MLLib) is now out of beta alpha. On the commercial side, the Databricks Cloud offering is now GA, with its ability to spin up an arbitrarily large Spark cluster at the touch of a button -- and give you a notebook-style interface to Spark. Amazon also announced turn-key Spark spin-up on AWS (but no notebook).
On the one hand, Spark is moving really fast. Half of the 8000+ Jira tickets have been entered in 2015 alone. On the other hand, there is so much people want in it that in some respects it seems it's not moving fast enough. With all that went in to Spark 1.4 for data science, improvements to Spark Core are to come in Spark 1.5 and 1.6. Although some of Project Tungsten made it into Spark 1.4, most of it is targeted for Spark 1.5, and the most interesting part -- on-the-fly compilation to Intel SIMD -- is slated for Spark 1.6. That will lay the groundwork to on-the-fly compilation to GPU, which will presumably come even later.
Another contentious issue with Spark developers (i.e. not data scientists) is the inability to launch, track, and control Spark YARN jobs via a Java API. The Bash shell is the only official way to submit Spark YARN jobs, making it impossible for Spark to, for example, serve as the back-end to a web app that expects to tightly control, monitor, and launch Spark jobs. This issue was raised again at the Bay Area Spark Meetup that was held on-site during Spark Summit, and the Databricks panel members reluctantly predicted that the upcoming "Launcher" mechanism would be in Spark 1.5 or Spark 1.6.

Monday, May 11, 2015

Yes, Neural Networks Have Grandmother Cells

The neural network portions of the above image come from Wikipedia.

The age-old debate about neural networks (both artificial and biological) is whether they have a grandmother cell, a neuron cell/node somewhere in the net that is activated when one's grandmother is viewed (assuming a biological vision scenario or computer vision application).

For biological neural networks, the jury is out, and the answer is leaning toward the "no" side. For artificial neural networks, if you Google for the answer, you'll almost always come across the admonishment to avoid grandmother cells in your neural networks. But to beginners to neural networks, this advice can be easily misunderstood.

More precisely, grandmother cells are to be avoided in the internal nodes. The reason is that internal nodes are supposed to be for latent variables, i.e., intermediate properties like "big eyes" or "big teeth". If an internal node is already recognizing grandma, then that is an indication of overfitting and that perhaps the neural network was created too large for the amount of training data or search space.

But all artificial neural networks whose job it is to classify images where one of the classes is grandma will have a grandmother cell, namely in the output layer. That's simply how you get your output from a classifier artificial neural network.

Thursday, May 7, 2015

RDF vs. Property Graphs: Excerpt From "Spark GraphX In Action"

The above is an animated interpretation of one of the (static) images from chapter 3 of my upcoming book Spark GraphX In Action. Chapter 3 is now available for download for those who have purchased the MEAP. Today, May 7, 2015, the MEAP is available for a 50% discount using the discount code dotd050715au.

Thursday, March 12, 2015

My new book: Spark GraphX In Action

My new book became available on MEAP on March 6, 2015, and the print version is expected later in 2015.

Spark and Graphs are two tools rapidly being adopted by Data Scientists. Combine them and you get the GraphX component of Spark. Written in a way that assumes minimal Scala, the book quickly brings readers up to speed on all four: Scala, Spark, definitions from graph theory, and GraphX itself. An appendix provides three quick ways to get Spark running, and the chapters provide explicit examples from the most basic up through machine learning.

Monday, March 2, 2015

Spark Ecosystem

Presentation I gave to the Data Science Association on February 28, 2015:

  1. How fast Spark is moving
  2. Ways to keep up
  3. Current status of the ever-expanding Spark ecosystem

Original video

Re-recorded later as a screencast

Tuesday, February 24, 2015

State of the Stream 2015Q1

The big news this week -- and for the past couple of months, really -- is of course eBay open sourcing theirPulsar real-time analytics framework, which allows SQL queries into a real-time data stream. It's built on top of their Jetstream data streaming framework, also open sourced this month, which, from its list of capabilities seems to be a layer akin to Storm and Spark Streaming, with sinks and sources, Kafka connectivity, and REST interfaces for cluster monitoring.
With all the breathless news items out there about Pulsar, they neglect to mention one important fact: it is GPL, which precludes its use at a lot of organizations.
This is in stark contrast to some other big news this past week: Druid just switched to an Apache license from its use of the GPL.
Streaming is important because as I've previously blogged, every Big Data project eventually becomes a data streaming project, because once insights are found in Big Data, the consumers of those insights request it be re-run with ever more up-to-date data, until eventually they request real-time updates.
I've also previously blogged, over a year ago now, on the comparison of streaming frameworks, especially comparing Storm with Spark Streaming. At the time, I relayed how at the February, 2013 Strata, several speakers spoke of how Trident, the layer that gives Storm exactly once semantics, makes Storm non-performant. I also wrote at that time how Spark Streaming, even though it had exactly-once semantics, lacked things like graceful shutdown and transactions.
Well, there is recent news about about both Spark Streaming and Storm. First, Spark Streaming gained high availability in 1.2.0, released in December, 2014. And although it has exactly once semantics, doing so with Kafka is a bit tricky since Kafka itself does not guarantee exactly once when a node in a Kafka cluster goes down. However, Spark 1.3.0 will have Kafka exactly once semantics. Spark 1.3.0 release candidates are being voted on now, and will probably be released in the first half of March, 2015.
As for performance, a University of Toronto grad student recently benchmarkedStorm vs. Spark Streaming:

"Storm was around 40% faster than Spark, processing tuples of small size (around the size of a tweet). However, as the tuple's size increased, Spark had better performance maintaining the processing times."