Wednesday, June 7, 2017

Spark Summit 2017 Review


Spark Summit 2017 was all about Deep Learning. Databricks, which has long offered deep learning with GPUs on its commercial cloud service, announced they are open sourcing a deep learning library Deep Learning Pipelines which seems to lack GPU support. Similarly, Intel open sourced their own deep learning library, BigDL, also without GPU support, because Intel is pushing their FPGA-juiced Xeons for accelerated BLAS for machine learning (which I first blogged about three years ago).

For now, the leading contender for Spark GPU deep learning still seems to be DeepLearning4j, which is what I used in my Spark Summit 2017 presentation Neuro-Symbolic AI for Sentiment Analysis. (I will link video and slides once they are posted.)

The big announcement the second day (non-training) of the Summit was that Databricks created a serverless version of its commercial cloud service. This should, at least theoretically, significantly reduce the cost for companies making Spark available to their data scientists, thus (finally) offering a compelling use over trying to run Zeppelin, Jupyter, or Spark Shell on-premises.

A year out from Spark Summit 2016, I was surprised to hear about so many real-world uses of GraphX. The only thing I personally heard about GraphFrames was from a Databricks presentation. GraphFrames does still seem to be the future, but even that is not crystal clear, as Ion Stoica in the second day's Fireside Chat touted Tegra for (finally) mutable graphs, which is based on GraphX rather than GraphFrames. (I first blogged about Tegra in my review of last year's Spark Summit.)

There was more natural language processing (NLP) at the Summit than ever before. At the Fireside Chat, Ben Lorica pushed hard on Ion Stoica and Matei Zaharia to incorporate NLP into the Apache Spark distribution. My favorite keynote was by Riot Games on language-agnostic (English, Chinese, Japanese -- it didn't care) chat text messaging abusive language detection. And, of course, my own presentation was on NLP.

Finally, Structured Streaming finally got officially labeled as production-ready, meaning Spark Streaming will eventually destined for the deprecation graveyard. There was a demo of 10ms latency, to compete with Storm and Flink. No more micro-batches!