And that 100x is on top of the improvement over Hadoop's disk-based MapReduce, which itself was another 100x speedup.So combined that's 10,000x speedup from disk-based Hadoop MapReduce to memory-based Spark 2.0 Dataframes.
During a panel at Spark Summit, a question was put to panelist Thomas Dinsmore as to whether Spark has been overhyped. The day before, I had actually met up with Thomas in the Speaker's lounge and he wanted to get the input from others as to the answer to this question. My response was: Spark might have been overhyped during the 1.x days, but with Spark 2.0 it's caught up to the hype generated during the 1.0 days.
The mantra with Spark has always been: it's in-memory so it's fast -- with an unstated implication that it'snot possible to go any faster than that. Well, as we've seen, it was possible to go faster -- 100x as fast. Spark 2.0 delivers on the Spark 1.0 buzz. Now, Spark 1.x was certianly useful, and it's not like there was anything faster at the time for clusters of commodity hardware, but Spark 1.x carried with it a buzz that didn't get fully realized until Spark 2.0.
The next-most obvious thing at the Summit was the proliferation of graphs, including a keynote by Capital One(although they used an external graph database rather than GraphX or GraphFrames) and of course my own. But besides these, Ankur Dave gave two talks, one on Tegra and one on GraphFrames. Tegra is interesting because it introduces to Spark for the first time graphs that can grow. It's research work that was done built upon GraphX, but hopefully that technology will get added to GraphFrames.
Besides these four talks, there was yet another presentation, this one from Salesforce, on threat detection.
The third-most obvious thing to note about Spark Summit 2016 were the crowds. There were 2500 attendees this year compared to last year's 1500. I've attended some crowded events in my lifetime -- including the 1985 Beach Boys concert on the National Mall (lawn) attended by 800,000 and the 1993 papal Mass in Denver's Cherry Creek State Park attended by 500,000 -- but neither was as crowded as Spark Summit 2016. I'm just glad I was able to hide out in the Speaker lounge during mealtimes. And I feel sorry for the vendors, as the vendor expo this time had to be shunted off to a separate tower of the hotel conference center.
It seems they're going to need to Moscone this next time.
There were five parallel tracks this time compared to three tracks a year ago. I spent most of my time in the "research" track, which is one of the two new tracks. And there were a lot of interesting talks there:
Re-architecting Spark for Performance Understandability - a rewrite of Spark such that each task consumes only one type of resource (CPU, disk, or network) so that performance and bottlenecks can be visualized as well as be reasoned about by a scheduler. Although Kay Ousterhout's previous scheduling work, Sparrow, never made it into Spark, my hunch is that this eventually will. My hunch is that at the time three years ago, there were so many other ways to improve performance (such as Tungsten which as noted above has given a 100x performance improvement through Spark 2.0) that the cost/benefit at the time for incorporating Sparrow wouldn't have been worth it. But now that other avenues have been maxed out, the time is ripe for scheduler improvements.
The final message was just all the big name vendors jumping on board: Microsoft, IBM, Vertica. I remember back when even Cloudera refused to support Spark (in early 2013).