Wednesday, January 4, 2017

Spark Structured Streaming Supports Kafka Since November 2016

As I noted in my May 14, 2016 blog post, Spark Structured Streaming, which brings the ability to stream a data source into a DataFrame and query it with SQL in real-time, was announced with much fanfare (along with Spark 2.0) at Spark Summit 2016, but notably absent at the time was its support for Kafka.

Diagram from

Yes, Spark 2.1, released last week, now supports Kafka in Spark Structured Streaming. But so does Spark 2.0.2, quietly released on November 14, 2016.

So we no longer "have to wait for it" as I blogged last May.

Wednesday, October 26, 2016

Drizzle Brings Low-Latency Streaming to Spark; but RISE Lab is Just a Change in Funding

This morning at Spark Summit Europe 2016, Ion Stoica announced during his keynote the Drizzle project, which promises to reduce streaming data latency in Spark to be less than Flink and Storm. Ion announced this in the context of the new RISE Lab at UC Berkeley.

Drizzle is an exciting and important new technology. RISE Lab is simply a change in funding at Berkeley. In fact, Drizzle was announced at Spark Summit (West) this past summer in the context of amplab, not RISE Lab.

Stoica also repeated the common wisdom that Spark came out of amplab, but in fact Matei's first paper on Spark and RDDs came out in 2010 under RAD Lab, the funding model that preceded amplab.

These changes, from RAD Lab to amplab to RISE lab are just changes in funding. The important things -- the people and the projects -- stay throughout. And Drizzle is an important project. By making the streaming tasks long-lived on Spark workers -- as opposed to launching all-new fresh Spark jobs for every micro-batch as in today's Spark Streaming -- latency and resiliency are vastly improved. They are reported to be better than Flink, but keep in mind that the comparison there is between a research project vs. something that is available today to put into production. Flink might improve further by the time Drizzle is released (I don't think the code is even available to download yet to try out).

To watch Ion's keynote, go to about 1:15:00 at

For more meaty details on Drizzle, see the Spark Summit (West) 2016 presentation Low Latency Execution for Apache Spark.

Saturday, August 27, 2016

Installation Quickstart: TensorFlow, Anaconda, Jupyter

What better way to start getting into TensorFlow than with a notebook technology like Jupyter, the successor to IPython Notebook? There are two little hurdles to achieve this:
  1. Choice of OS. Trying to use Windows with TensorFlow is as painful as trying to use Windows with Spark. But even within Linux, it turns out you need a recent version. CentOS 7 works a lot better than CentOS 6 because it has a more recent glibc.
  2. A step is missing from the TensorFlow installation page. From StackOverflow:
    conda install notebook ipykernel
Here then are the complete set of steps to achieve Hello World in Tensorflow on Jupyter via Anaconda:
  1. Use CentOS 7.2 (aka 1511), for example using VirtualBox if under Windows. This step may be unnecessary if you use OSX, but I just haven't tried it.
  2. Download and install Anaconda for Python 3.5.
  3. From the TensorFlow installation instructions:
    conda create -n tensorflow python=3.5
    source activate tensorflow
    conda install -c conda-forge tensorflow
  4. From StackOverflow:
    conda install notebook ipykernel
  5. Launch Jupyter:
    jupyter notebook
  6. Create a notebook and type in Hello World:
    import tensorflow as tf
    hello = tf.constant('Hello, TensorFlow!')
    sess = tf.Session()

Saturday, July 30, 2016

900TB, 4U, $60k

The impasse of Peak Hard Drive that I identified two years ago -- that era when 4TB hard drives had been the tops for three years straight -- has been breached, thanks to advances in both hard drives (HDDs) and Solid State Disks (SSDs). 10TB 3.5" HDDs can be ordered from Amazon now for less than $600, and the density of SSDs is skyrocketing as manufacturers promised two years ago. At $10,000, the 2.5" 15TB SSD is indeed dense but at $670/TB, is twice as expensive per terabyte as enterprise 1TB SSDs.
Going forward, Toshiba is predicting 128TB SSDs by 2018 and 40TB HDDs by 2020. As shown in the chart above, SSDs would then firmly breach the long-term exponential growth line in density, while HDDs would merely continue their anemic density growth.

Rack chassis are getting more dense as well. SuperMicro now has a 4U chassis that can hold 90 3.5" HDDs.

Filled with the 10TB HDDs, that would be about $60k for 900TB storage in 4U of rack space. There is no similar solution I've found for 2.5" drives, so using the same chassis for the 15TB SSDs would be $900k for 1.4PB.
But of course that ignores the compute side of the equation. Such ultra-density in a rack would only be suitable for deep-freeze storage. For a Hadoop/Spark application to leverage data locality, a 1U chassis that accommodates 8 2.5" drives and dual processors from SuperMicro would be more appropriate. Populated with the 15TB SSDs, that would be 120TB/1U, or 960TB/8U. If each CPU socket were populated with 22-core Xeons, that would be a total of 352 cores in that 8U. Total cost would be about $750k for that 8U.

Monday, July 18, 2016

Spark Sometimes Forgets to Put Distantly Scoped Variables into Your Closure

This has always been a gotcha with Spark and still is today, yet I don't see a caution of it mentioned much of anywhere.

If your .map() needs access to a variable (or value), and that variable is not defined in the same immediate local scope, "sometimes" Spark will not include it in the closure, leading to erroneous results.

I've never been able to define "sometimes", and I've never been able to come up with a tiny example that demonstrates it. Nevertheless, below is a tiny bit of source code (which does work; that is it does not demonstrate the problem) just to make clear what I'm talking about.

import org.apache.spark.SparkContext
import org.apache.spark.SparkConf

object closure {
  val a = Array(1,2)
  def main(args: Array[String]) {
    // Sometimes the line of code below is necessary (and change the
    // reference to a in the map() to a2 as well)
    // val a2 = a
    val sc = new SparkContext(new SparkConf().setMaster("local[2]").setAppName("closure"))
    println(sc.makeRDD(Array(3,4)).map(_ + a.sum).collect.mkString(";"))

I've seen it where a "sometimes" gets transmitted to the cluster as a zero-length array.

As background, functional languages like Scala compute closures, which means that when you pass a function as a parameter, it's not just the function that gets passed but all the variables and values that it requires along with it. Scala does compute closures, but not serialize closures for distributed computing. Spark has to compute and serialize its own closures, and sometimes it makes mistakes. Sometimes, it's necessary to give it some help by moving the data you need into the same local scope so that it can pick it up.

Saturday, June 11, 2016

Spark Summit 2016 Review

Summit (West) 2016 took place this past week in San Francisco, with the big news of course being Spark 2.0 which among other things ushers in yet another 10x performance improvement through whole-stage code generation.
This is on top of the 2x performance improvement going from RDDs to 1.4 Dataframes, the 3.5x improvement going from 1.4 Dataframes to 1.5 Dataframes, and the miscellaneous improvements in Spark 1.6 includingautomatic cache vs. execution memory balancing. Overall, this is perhaps a 100x improvement from Spark 0.9 RDDs (July 2014) to Spark 2.0 Dataframes (July 2016).
And that 100x is on top of the improvement over Hadoop's disk-based MapReduce, which itself was another 100x speedup.So combined that's 10,000x speedup from disk-based Hadoop MapReduce to memory-based Spark 2.0 Dataframes.


During a panel at Spark Summit, a question was put to panelist Thomas Dinsmore as to whether Spark has been overhyped. The day before, I had actually met up with Thomas in the Speaker's lounge and he wanted to get the input from others as to the answer to this question. My response was: Spark might have been overhyped during the 1.x days, but with Spark 2.0 it's caught up to the hype generated during the 1.0 days.
The mantra with Spark has always been: it's in-memory so it's fast -- with an unstated implication that it'snot possible to go any faster than that. Well, as we've seen, it was possible to go faster -- 100x as fast. Spark 2.0 delivers on the Spark 1.0 buzz. Now, Spark 1.x was certianly useful, and it's not like there was anything faster at the time for clusters of commodity hardware, but Spark 1.x carried with it a buzz that didn't get fully realized until Spark 2.0.
In another sign of maturity, Spark 2.0 was not rushed out the door prior to the Summit. This is in contrast to Spark 1.0, which two years ago I criticized as not being stable enough for a 1.0 moniker. (I think Spark 1.2 was really the 1.0 of Spark.)


The next-most obvious thing at the Summit was the proliferation of graphs, including a keynote by Capital One(although they used an external graph database rather than GraphX or GraphFrames) and of course my own. But besides these, Ankur Dave gave two talks, one on Tegra and one on GraphFrames. Tegra is interesting because it introduces to Spark for the first time graphs that can grow. It's research work that was done built upon GraphX, but hopefully that technology will get added to GraphFrames.
Besides these four talks, there was yet another presentation, this one from Salesforce, on threat detection.


The third-most obvious thing to note about Spark Summit 2016 were the crowds. There were 2500 attendees this year compared to last year's 1500. I've attended some crowded events in my lifetime -- including the 1985 Beach Boys concert on the National Mall (lawn) attended by 800,000 and the 1993 papal Mass in Denver's Cherry Creek State Park attended by 500,000 -- but neither was as crowded as Spark Summit 2016. I'm just glad I was able to hide out in the Speaker lounge during mealtimes. And I feel sorry for the vendors, as the vendor expo this time had to be shunted off to a separate tower of the hotel conference center.
It seems they're going to need to Moscone this next time.


There were five parallel tracks this time compared to three tracks a year ago. I spent most of my time in the "research" track, which is one of the two new tracks. And there were a lot of interesting talks there:
  • Yggdrasil: Faster Decision Trees Using Column Partitioning in Spark
  • Low-latency Execution for Apache Spark - a modification to Spark whereby the number of communication round trips to the driver is minimized.
  • Re-architecting Spark for Performance Understandability - a rewrite of Spark such that each task consumes only one type of resource (CPU, disk, or network) so that performance and bottlenecks can be visualized as well as be reasoned about by a scheduler. Although Kay Ousterhout's previous scheduling work, Sparrow, never made it into Spark, my hunch is that this eventually will. My hunch is that at the time three years ago, there were so many other ways to improve performance (such as Tungsten which as noted above has given a 100x performance improvement through Spark 2.0) that the cost/benefit at the time for incorporating Sparrow wouldn't have been worth it. But now that other avenues have been maxed out, the time is ripe for scheduler improvements.

Corporate Support

The final message was just all the big name vendors jumping on board: Microsoft, IBM, Vertica. I remember back when even Cloudera refused to support Spark (in early 2013).