Saturday, May 14, 2016

Structured Streaming for Lambda Architecture in Spark But Have To Wait For It

Some have the misconception that Lambda Architecture just means you have separate paths for batch and realtime. They miss a key part of Lambda Architecture: the ability to query a unified view of both batch and realtime.
Structured Streaming, also known as Structured Dataframes, will provide a critical piece: the ability to stream directly into a Dataframe, which can then of course be queried with SQL.
To provide the unified view, it will probably be possible to join such a Streaming Dataframe containing the realtime data with an ORC-backed Dataframe containing the historical data. However, as of today (May 14, 2016), the only two data sources available to populate a Streaming Dataframe are memory and file. Notably absent are streaming sources such as Apache Kafka, and last week Michael Armbrust indicated support for non-file data sources might come after Spark 2.0. And then this week Reynold Xin advised:
stay tuned to this blog for more details on Structured Streaming in Spark 2.0, including details on what is possible in this release and what is on the roadmap for the near future
There are still key adds in Spark 2.0: full SQL support including subqueries, and yet another 10x performanceimprovement due to "Tungsten 2.0" (on top of the 2x-10x improvement Tungsten brought over Spark 1.4, 1.5, and 1.6). Currently, Druid is still the reigning champ when it comes to Lambda in a Box. But Spark will likely take that crown before the end of this year.

Thursday, April 7, 2016

Declarative Machine Learning

SQL is commonly referred to as a 4GL, or fourth-generation programming language, as opposed to all of the 3GL's like Java, C++, Python, Scala, etc. SQL is referred to as a declarative language as opposed to an imperativelanguage like the 3GL's. You tell SQL what to do, not how to do it.
Well, TuPAQ is the SQL for machine learning. You give it a high-level goal, and it figures out which machine learning algorithm to use, and tunes the hyperparameters for you. Example code for speech-to-text translation from Evan Sparks et al:
SELECT vm.sender, vm.arrived,
GIVEN LabeledVoiceMails
FROM VoiceMails vm
WHERE vm.user = 'Bob' AND vm.listened is NULL
ORDER BY vm.arrived
When will you be able to use this in production? Hopefully, it's not too far away -- maybe a year, as a wild guess. At Spark Summit in June, 2015, Evan Sparks indicated KeystoneML would "soon" integrate with TuPAQas both KeystoneML and TuPAQ are AMPLab projects.
Although I gave KeystoneML a tepid review when it first came out, the new 0.3 version announced last weekshows the impressive direction they're headed in. Although not quite as declarative as TuPAQ, it is still declarative. An example of declaring a machine learning pipeline in KeystoneML:
val trainData = NewsGroupsDataLoader(sc, trainingDir)

val predictor = Trim andThen
    LowerCase() andThen
    Tokenizer() andThen
    NGramsFeaturizer(1 to conf.nGrams) andThen
    TermFrequency(x => 1) andThen
    (<CommonSparseFeatures(conf.commonFeatures), andThen
    (NaiveBayesEstimator(numClasses),, trainData.labels) andThen
Sure, the package from Spark MLlib is also pipeline-centric, but whereas simply relies on DataFrames/Catalyst/Tungsten to optimize each stage of the pipeline, KeystoneML analyzes and optimizes the pipeline as a whole. It "inspects the pipeline DAG and automatically decides where to cache intermediate output using a greedy algorithm."
Are there other declarative machine learning systems out there? Apache SystemML claims to be declarative, but it is only in that automatically plans deployment to a cluster based on data locality, available memory, etc.
SystemML claims that the high-level languages it provides, DML and PyDML, are "declarative", but they are not. They are still imperative languages. Their purpose is to allow non-Spark developers to write machine learning programs in languages they are comfortable in (like Python), yet be able to compile down to Spark Scala when the time comes to deploy to production. Thus, these are high-level languages as SystemML claims, but they are still imperative and not declarative. The ability of SystemML to plan optimal deployment to a cluster, however, is declarative.

Tuesday, March 29, 2016

Table of XX2Vec Algorithms

XX2VecEmbedInSup/UnsupAlgorithms used
Char2VecCharacterSentenceUnsupervisedCNN -> LSTM
Doc2VecParagraph VectorDocumentSupervisedANN -> Logistic Regression
Image2VecImage ElementsImageUnsupervisedDNN
Video2VecVideo ElementsVideoSupervisedCNN -> MLP
The powerful word2vec algorithm has inspired a host of other algorithms listed in the table above. (For a description of word2vec, see my Spark Summit 2015 presentation.) word2vec is a convenient way to assign vectors to words, and of course vectors are the currency of machine learning. Once you've vectorized your data, you are then free to apply any number of machine learning algorithms.
word2vec is able to come up with vectors by leveraging the concept of embedding. In a corpus, a word appears in the context of surrounding words, and word2vec uses those co-occurrences to infer relationships between those words.
All of the XX2Vec algorithms listed in the table above assign vectors to X's, where those X's are embedded in some larger context Y.
But the similarities end there. Each XX2Vec algorithm not only goes about it through means suited for its domain, but their use cases aren't even analagous. Doc2Vec, for example, is supervised learning whereas most of the others are unsupervised learning. The goal of Doc2Vec is to be able to apply labels to documents, whereas the goal of word2Vec and most of the other XX2Vec algorithms is simply to spit out vectors that you can then go and do other machine learning and analyses on (such as analogy detection).
Here is a brief description of each XX2Vec:


Like word2vec but because it operates at the character level, it is much more tolerant of misspellings and thus better for analysis of tweets, user product reviews, etc.


Described above. But one more note: it's one of those unreasonably effective algorithms -- a kind of getting lucky, if you will.


Instead of just getting lucky, there have been a number of efforts to ground the idea of word embeddings in something more mathematical than just pulling weights out of a neural network and hoping they work. GloVe is the current standard-bearer in this regard. Its model is designed from the ground up to support finding analogies, instead of just getting them by chance in word2vec.


Actually, Doc2Vec uses Word2Vec as a first pass. It then comes up with a composite vector for each sentence or paragraph from the contributing Word2Vec word vectors. This composite gives some kind of overall context to the sentence or paragraph, and then this composite vector is plopped down into the beginning of the sentence or pargraph as an "extra word". The paragraph vectors togeher with the word vectors are used to train a supervised-learning classifier using human labels of the documents.


Whereas word2vec intentionally uses a shallow neural network, Image2Vec uses a deep neural network and composes the resultant vectors from the weights from multiple layers of the network. Image elements that might be represented by these weights include image fragments (grass, bird, fence, etc.) or overall image qualities like color.


If machine learning on images involves high dimensions, videos involve even higher dimensions. Video2Vec does some initial dimension reduction by doing a first pass with convolutional neural networks.

Monday, March 21, 2016

DataFrame/DataSet swap places in Spark 2.0

In Spark 1.6, the developers behind Spark created DataSets by copying and pasting the code from DataFrames (and then added genericization and type safety). But in Spark 2.0, the tables are turned. Last week, Reynold Xin resolved SPARK-13880 "Rename DataFrame.scala as DataSet.scala. So what happens to DataFrames in Spark 2.0? Reduced to a single line of code:

type DataFrame = Dataset[Row]

So whereas it could be said in Spark 1.6 that DataSets are a derivation of DataFrames, it is specifically the case in Spark 2.0 that DataFrames are a derivation of DataSets.

Friday, March 11, 2016

Symmetric Difference in GraphX

A question was posed over at the online forums for my book about how to implement symmetry difference in GraphX. The answer is the code below.

import org.apache.spark.graphx._
val g = Graph(sc.makeRDD(Array((1L,""),(2L,""),(3L,""))),
val ids =
Graph(g.vertices, ids.cartesian(ids).filter(x => x._1 < x._2)
       .map(x => Edge(x._1,x._2,0)))
     (_,_,u) => u.get.toSet)
  .mapTriplets(et => ((et.srcAttr | et.dstAttr) &~
                      (et.srcAttr & et.dstAttr)).size)
  .map(et => (et.srcId, et.dstId, et.attr))

This short piece of code pulls a number of tricks. First is the overall strategy. The goal is to identify the symmetric difference size for every possible pair of vertices in the graph. This suggests that we need to do a Cartesian product to obtain all possible pairs of vertices. But rather than just getting the Cartesian product and doing an RDD map() directly off that, we instead create a whole new Graph where the edges are that Cartesian product. The reason is so that we can leverage outerJoinVertices() and glom on the set of nearest neighbors using collectNeighborIds()(which returns a VertexRDD, suitable for outerJoinVertices()).

And collectNeighborIds() itself is a powerful function that didn't get covered in my book. It's a convenient way to, for each vertex, gather the vertex Ids of all the neighbor vertices.

Finally, to compute the symmetry difference we use Scala Set operations, as the symmetry difference is defined as:

A Δ B = (A ∪ B) - (A ∩ B)

Note in Scala the set difference operator is &~ rather than -.

Saturday, March 5, 2016

Beyond GraphX in graphs for Spark

This week Databricks announced GraphFrames, a library posted to that is based on Spark SQL Dataframes rather than RDDs (as GraphX is). GraphFrames is still a work in progress -- it is currently at the 0.1 version -- so it provides interoperability with GraphX (graphs can be converted back and forth).
GraphFrames provides the graph querying capability that GraphX always had trouble with. GraphFrames, because it uses DataFrames from Spark SQL, allows you to query graphs using SQL. Plus GraphFrames sports a subset of Cypher, the query language from Neo4j.
I describe GraphFrames and provide some interesting examples in chapter 10 of my book. Chapter 10 was just released to the MEAP (Manning Early Access Program) for my book this week.
GraphFrames is also performant due to the two optimization layers built-in to Spark SQL: Catalyst and Tungsten. Catalyst is an RDBMS-style query plan optimizer, and Tungsten leverages the sun.misc.unsafe API to do direct OS memory access, bypassing the JVM (as well as avoiding garbage collection). Tungsten also performs code generation, generating JVM bytecode on the fly to access Tungsten-laid-out memory structures in a maximally efficient manner. One of the examples in my book shows an 8x speedup compared to the GraphX version.
And, in a hat tip to Andy Petrella, author of Spark Notebook, GraphFrames is not the only new library published on There are also:
  • Spark Centrality - Library for computing centrality for graph nodes
  • spark-beetweenness - k Betweenness Centrality algorithm for Spark using GraphX
  • sparkling-graph - Large scale, distributed graph processing made easy! Load your graph from multiple formats and compute measures (but not only)

Thursday, December 17, 2015

GPU off Apache Spark roadmap: Deeplearning4j best bet for Spark GPU

Last night, Reynold Xin took SPARK-3785 "Support off-loading computations to a GPU" off the Apache Spark road map, marking it "Closed" with a resolution of "Later". This is a little different than when GPU was mentioned at Spark Summit in June, 2015 as a possibility for Project Tungsten for 1.6 and beyond.
So for now, the best bet for using GPUs on Spark is Deeplearning4j, from which their architecture diagram above came. As I've blogged previously, the DL4J folks are waiting until they have solid benchmarks before advertising them. Nevertheless, today, you can do deep learning on GPU-powered Spark.