Friday, October 27, 2017

The Spark GraphX of actual combat

Earlier this year, my book was translated into a Chinese edition. It actually has sold extremely well. I just noticed that Amazon has a product page for it, and they've given it the title The Spark GraphX of actual combat (Chinese Edition).

My hunch is that's what one gets if one translates "Spark GraphX in Action" into Chinese and then back into English.

Friday, October 20, 2017

Neo4j's query language Cypher coming to Spark

In my 2016 Spark Summit presentation Finding Graph Isomorphisms in GraphX and GraphFrames I reviewed the history of graphs in Spark, and how to query a graph in Spark GraphX required many more lines of code than an equivalent query in Neo4j using its Cypher language. Even Spark GraphFrames, which implements a tiny, tiny subset of Cypher requires more code than full Cypher.

Two years ago at the 2015 GraphConnect (an event sponsored by Neo4j), Ion Stoica of Databricks announced:
We look forward to bringing Cypher's graph pattern matching capabilities into the Spark stack, making graph querying more accessible to the masses.
Well, two years later, Neo4j announced yesterday:
Neo4j, a leader in connected data, announced that it has released the preview version of Cypher for Apache Spark (CAPS) language toolkit. 
[...] Until now, data scientists have been using Spark and query tools like GraphX to define extensions to their graphs. Once identified, they would then re-implement and deploy that work within their applications. Now, with Cypher for Apache Spark, these scientists can iterate easier and connect adjacent data sources to their graph applications much more quickly. 
[...] This announcement builds on Neo4j’s unveiling of openCypher in October 2015, as an effort to push the whole graph industry forward by tapping into the open source community and making Cypher’s evolution an open exercise while avoiding redundant research.

Wednesday, June 7, 2017

Spark Summit 2017 Review

Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, announced they are open sourcing a deep learning library Deep Learning Pipelines which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, BigDL, also without GPU support, because Intel is pushing their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first blogged about three years ago).

For now, the leading contender for Spark GPU deep learning still seems to be DeepLearning4j, which is what I used in my Spark Summit 2017 presentation Neuro-Symbolic AI for Sentiment Analysis. (I will link video and slides once they are posted.)

The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.

A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's Fireside Chat touted Tegra for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in my review of last year's Spark Summit.)

There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging abusive language detection. And, of course, my own presentation was on NLP.

Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a demo of 10ms latency, to compete with Storm and Flink. No more micro-batches!

Tuesday, March 7, 2017

Zeppelin installation tips

If you need to run Apache Zeppelin either a) on a headless server or b) behind a proxy, see below.

Headless server

From your expanded zeppelin directory:

cp conf/zeppelin-site.xml.template conf/zeppelin-site.xml
nano conf/zeppelin-site.xml

And change zeppelin.server.addr to be either the IP address or the domain name of this server. This is to allow outside connections.


Zeppelin seems to need npm from node.js, which in turn needs to know your proxy settings. To get around this, install node.js yourself (instead of relying on what is built in to Zeppelin) and execute npm config to set its proxy settings. Below includes the instructions for installing node.js onto RedHat-type Linux distributions (CentOS, Oracle Linux, etc.). See for other OS's.

export http_proxy=<your http proxy>
export https_proxy=<your https proxy>

wget curl --silent --location
chmod 777 setup_6.x
sudo ./setup_6.x

sudo yum install -y nodejs

npm config set proxy <your http proxy>
npm config set https-proxy <your https proxy>

Wednesday, January 4, 2017

Spark Structured Streaming Supports Kafka Since November 2016

As I noted in my May 14, 2016 blog post, Spark Structured Streaming, which brings the ability to stream a data source into a DataFrame and query it with SQL in real-time, was announced with much fanfare (along with Spark 2.0) at Spark Summit 2016, but notably absent at the time was its support for Kafka.

Diagram from

Yes, Spark 2.1, released last week, now supports Kafka in Spark Structured Streaming. But so does Spark 2.0.2, quietly released on November 14, 2016.

So we no longer "have to wait for it" as I blogged last May.

Wednesday, October 26, 2016

Drizzle Brings Low-Latency Streaming to Spark; but RISE Lab is Just a Change in Funding

This morning at Spark Summit Europe 2016, Ion Stoica announced during his keynote the Drizzle project, which promises to reduce streaming data latency in Spark to be less than Flink and Storm. Ion announced this in the context of the new RISE Lab at UC Berkeley.

Drizzle is an exciting and important new technology. RISE Lab is simply a change in funding at Berkeley. In fact, Drizzle was announced at Spark Summit (West) this past summer in the context of amplab, not RISE Lab.

Stoica also repeated the common wisdom that Spark came out of amplab, but in fact Matei's first paper on Spark and RDDs came out in 2010 under RAD Lab, the funding model that preceded amplab.

These changes, from RAD Lab to amplab to RISE lab are just changes in funding. The important things -- the people and the projects -- stay throughout. And Drizzle is an important project. By making the streaming tasks long-lived on Spark workers -- as opposed to launching all-new fresh Spark jobs for every micro-batch as in today's Spark Streaming -- latency and resiliency are vastly improved. They are reported to be better than Flink, but keep in mind that the comparison there is between a research project vs. something that is available today to put into production. Flink might improve further by the time Drizzle is released (I don't think the code is even available to download yet to try out).

To watch Ion's keynote, go to about 1:15:00 at

For more meaty details on Drizzle, see the Spark Summit (West) 2016 presentation Low Latency Execution for Apache Spark.