Saturday, June 11, 2016

Spark Summit 2016 Review

Summit (West) 2016 took place this past week in San Francisco, with the big news of course being Spark 2.0 which among other things ushers in yet another 10x performance improvement through whole-stage code generation.
This is on top of the 2x performance improvement going from RDDs to 1.4 Dataframes, the 3.5x improvement going from 1.4 Dataframes to 1.5 Dataframes, and the miscellaneous improvements in Spark 1.6 includingautomatic cache vs. execution memory balancing. Overall, this is perhaps a 100x improvement from Spark 0.9 RDDs (July 2014) to Spark 2.0 Dataframes (July 2016).
And that 100x is on top of the improvement over Hadoop's disk-based MapReduce, which itself was another 100x speedup.So combined that's 10,000x speedup from disk-based Hadoop MapReduce to memory-based Spark 2.0 Dataframes.


During a panel at Spark Summit, a question was put to panelist Thomas Dinsmore as to whether Spark has been overhyped. The day before, I had actually met up with Thomas in the Speaker's lounge and he wanted to get the input from others as to the answer to this question. My response was: Spark might have been overhyped during the 1.x days, but with Spark 2.0 it's caught up to the hype generated during the 1.0 days.
The mantra with Spark has always been: it's in-memory so it's fast -- with an unstated implication that it'snot possible to go any faster than that. Well, as we've seen, it was possible to go faster -- 100x as fast. Spark 2.0 delivers on the Spark 1.0 buzz. Now, Spark 1.x was certianly useful, and it's not like there was anything faster at the time for clusters of commodity hardware, but Spark 1.x carried with it a buzz that didn't get fully realized until Spark 2.0.
In another sign of maturity, Spark 2.0 was not rushed out the door prior to the Summit. This is in contrast to Spark 1.0, which two years ago I criticized as not being stable enough for a 1.0 moniker. (I think Spark 1.2 was really the 1.0 of Spark.)


The next-most obvious thing at the Summit was the proliferation of graphs, including a keynote by Capital One(although they used an external graph database rather than GraphX or GraphFrames) and of course my own. But besides these, Ankur Dave gave two talks, one on Tegra and one on GraphFrames. Tegra is interesting because it introduces to Spark for the first time graphs that can grow. It's research work that was done built upon GraphX, but hopefully that technology will get added to GraphFrames.
Besides these four talks, there was yet another presentation, this one from Salesforce, on threat detection.


The third-most obvious thing to note about Spark Summit 2016 were the crowds. There were 2500 attendees this year compared to last year's 1500. I've attended some crowded events in my lifetime -- including the 1985 Beach Boys concert on the National Mall (lawn) attended by 800,000 and the 1993 papal Mass in Denver's Cherry Creek State Park attended by 500,000 -- but neither was as crowded as Spark Summit 2016. I'm just glad I was able to hide out in the Speaker lounge during mealtimes. And I feel sorry for the vendors, as the vendor expo this time had to be shunted off to a separate tower of the hotel conference center.
It seems they're going to need to Moscone this next time.


There were five parallel tracks this time compared to three tracks a year ago. I spent most of my time in the "research" track, which is one of the two new tracks. And there were a lot of interesting talks there:
  • Yggdrasil: Faster Decision Trees Using Column Partitioning in Spark
  • Low-latency Execution for Apache Spark - a modification to Spark whereby the number of communication round trips to the driver is minimized.
  • Re-architecting Spark for Performance Understandability - a rewrite of Spark such that each task consumes only one type of resource (CPU, disk, or network) so that performance and bottlenecks can be visualized as well as be reasoned about by a scheduler. Although Kay Ousterhout's previous scheduling work, Sparrow, never made it into Spark, my hunch is that this eventually will. My hunch is that at the time three years ago, there were so many other ways to improve performance (such as Tungsten which as noted above has given a 100x performance improvement through Spark 2.0) that the cost/benefit at the time for incorporating Sparrow wouldn't have been worth it. But now that other avenues have been maxed out, the time is ripe for scheduler improvements.

Corporate Support

The final message was just all the big name vendors jumping on board: Microsoft, IBM, Vertica. I remember back when even Cloudera refused to support Spark (in early 2013).