Wednesday, October 26, 2016

Drizzle Brings Low-Latency Streaming to Spark; but RISE Lab is Just a Change in Funding


This morning at Spark Summit Europe 2016, Ion Stoica announced during his keynote the Drizzle project, which promises to reduce streaming data latency in Spark to be less than Flink and Storm. Ion announced this in the context of the new RISE Lab at UC Berkeley.

Drizzle is an exciting and important new technology. RISE Lab is simply a change in funding at Berkeley. In fact, Drizzle was announced at Spark Summit (West) this past summer in the context of amplab, not RISE Lab.

Stoica also repeated the common wisdom that Spark came out of amplab, but in fact Matei's first paper on Spark and RDDs came out in 2010 under RAD Lab, the funding model that preceded amplab.

These changes, from RAD Lab to amplab to RISE lab are just changes in funding. The important things -- the people and the projects -- stay throughout. And Drizzle is an important project. By making the streaming tasks long-lived on Spark workers -- as opposed to launching all-new fresh Spark jobs for every micro-batch as in today's Spark Streaming -- latency and resiliency are vastly improved. They are reported to be better than Flink, but keep in mind that the comparison there is between a research project vs. something that is available today to put into production. Flink might improve further by the time Drizzle is released (I don't think the code is even available to download yet to try out).

To watch Ion's keynote, go to about 1:15:00 at http://livestream.com/fourstream/sparksummiteu16-tracka/videos/140168779

For more meaty details on Drizzle, see the Spark Summit (West) 2016 presentation Low Latency Execution for Apache Spark.