Tuesday, February 24, 2015

State of the Stream 2015Q1

The big news this week -- and for the past couple of months, really -- is of course eBay open sourcing theirPulsar real-time analytics framework, which allows SQL queries into a real-time data stream. It's built on top of their Jetstream data streaming framework, also open sourced this month, which, from its list of capabilities seems to be a layer akin to Storm and Spark Streaming, with sinks and sources, Kafka connectivity, and REST interfaces for cluster monitoring.
With all the breathless news items out there about Pulsar, they neglect to mention one important fact: it is GPL, which precludes its use at a lot of organizations.
This is in stark contrast to some other big news this past week: Druid just switched to an Apache license from its use of the GPL.
Streaming is important because as I've previously blogged, every Big Data project eventually becomes a data streaming project, because once insights are found in Big Data, the consumers of those insights request it be re-run with ever more up-to-date data, until eventually they request real-time updates.
I've also previously blogged, over a year ago now, on the comparison of streaming frameworks, especially comparing Storm with Spark Streaming. At the time, I relayed how at the February, 2013 Strata, several speakers spoke of how Trident, the layer that gives Storm exactly once semantics, makes Storm non-performant. I also wrote at that time how Spark Streaming, even though it had exactly-once semantics, lacked things like graceful shutdown and transactions.
Well, there is recent news about about both Spark Streaming and Storm. First, Spark Streaming gained high availability in 1.2.0, released in December, 2014. And although it has exactly once semantics, doing so with Kafka is a bit tricky since Kafka itself does not guarantee exactly once when a node in a Kafka cluster goes down. However, Spark 1.3.0 will have Kafka exactly once semantics. Spark 1.3.0 release candidates are being voted on now, and will probably be released in the first half of March, 2015.
As for performance, a University of Toronto grad student recently benchmarkedStorm vs. Spark Streaming:

"Storm was around 40% faster than Spark, processing tuples of small size (around the size of a tweet). However, as the tuple's size increased, Spark had better performance maintaining the processing times."