Saturday, December 28, 2013

Real-time Data Science

Looking back on 2013, the world of Hadoop emerged from the era of batch processing and into streaming processing. In the contextg of "crisp and actionable," actionable often comes with an expiration date. If you don't take action soon enough, it's too late.

As I learned at the February 2013 Strata conference, Electronic Arts instrumented their online games to continuously stream data into their Big Data system. And because much of their revenue now comes from in-game purchases (in games that are often free to play), they need to know immediately if, e.g., in-game purchases suddenly drop off for whatever reason. They need to take action immediately, not next week.

Here I look at four streaming technologies, and then conclude with some pointers to general streaming computation techniques.

Part of the Berkeley Data Analytics Stack, Spark Streaming is a layer on top of Spark. Spark is a distributed, redundant RAM-based data storage and computation system that was accepted as an Apache Incubator project in 2013. Whereas Hadoop distributes data across the hard drives in a cluster of computers, Spark distributes data across the RAMs in a cluster of computers.

Spark Streaming breaks incoming stream into batches of, for example, 2-second batches or 10-second batches, or however long you set the batch window to be. One criticism of Spark Streaming is that it is not "true real time" since it is up to a batch window size behind. Another limitation is that all processing for that batch must complete before the next batch comes in; otherwise, the behavior is undefined. That can leave a lot of compute capacity unused if the system is tuned to handle the worst-case batch, but normal-size batches leave the processors idle for much of the batch time window.

Storm, another Apache project originally from Twitter, is the other big name in streaming. Its biggest and most well-known shortcoming is that it doesn't guarantee that it won't duplicate events. It guarantees each event will delivered at minimum once. Since this can wreak havoc with real-time metric computation, an optional layer called Trident can de-dupe, but at a high performance penalty.

Storm has a great advantage over Spark Streaming in that it has an "ack" system built-in. If processing an event throws an exception or otherwise fails to "ack" its parent in the directed acyclic graph (DAG) of event flow processing, the event can be saved by the parent for later processing when the problem has been fixed.
Storm also has a web-based admin GUI for starting/stopping/monitoring running DAGs.

Since I don't have any first-hand experience with S4, I have to rely on others' evaluations. At a presentation by Ted Dunning of MapR, he indicated that S4 was designed for topologies and situations that are more complex than usually encountered in the real world. A Yahoo! researcher blogged an excellent comparison between S4 and Storm, and boils it down to guaranteed delivery (Storm) vs. state recovery in case of fault (S4).

Druid is different than the rest in that it's more than just streaming; it also provides a data store and querying system. It's an all-in-one solution whereas the others are usually married to Hadoop.

Druid breaks the storage into two parts, one for real-time (typically 30 minutes worth of data), and the other for permanent "deep storage". A single querying facade can query both stores transparently.

The bad news is that the querying is currently limited. With no SQL, queries must be formed using a stilted JSON syntax, and the ability to create complex queries is limited.
But querying is fast. In fact, it's Druid's hallmark. In response to the common problem of cubing data with high dimensionality, the Druid folks decided instead to just be able to query really fast by storing the data in a distributed RAM-based cluster. With fast querying, there's no need to pre-compute and pre-aggregate.

LinkedIn contributed Samza to Apache in September, 2013. It's new and I haven't tried it yet, but it appears to integrate Kafka and YARN more deeply than the others, and it maintains state similar to how Spark Streaming allows with its RDDs (resilient distributed dataset) and its updateStateByKey() method.

While all the others are for self-hosted infrastructure, Amazon Kinesis is strictly for the could. Again, since I haven't used it myself, I rely on another blogger who informs that Kinesis is restricted to a single procedure, and so isn't suited for complex processing. There is also no rollback capability.

Stream Computation Techniques

It's evident fairly quickly that the mean can be computed in an incremental manner. If mn is the mean of x1 ... xn, then mn+1 = (n*mn + xn+1)/(n+1).

Much less obvious is that it's possible to calculate the standard deviation incrementally. Donald Knuth credits a 1962 paper by Welford with a clever technique.

Even more clever is calculating cardinality incrementally, i.e. the number of distinct items in a set, i.e. SELECT COUNT DISTINCT(*) -- and doing so without having to keep the entire dataset in memory. The trick is to use a hash function that uniformly distributes to the range [0.0,1.0) and then keep track of the most number of leading zeroes. E.g. if the set as 100 items, probabilistically we would expect the smallest hash value to end up being around 0.01, or one leading zero. For a set of 1000 items, some hash values by chance would probably end up being less than that, and the smallest overall would probably end up being around 0.001, or two leading zeroes.

Finally, classic machine learning algorithms such as K-means and decision tree construction have been adapted to incremental versions for streaming data. Joao Gama, a Portuguese professor, has assembled them into an excellent book:


Basic metrics permit real-time monitoring for:

  • Fraud detection
  • Fault detection
  • Capacity overload detection

Streaming machine learning adds:

  • Intelligent automated adjustment of thresholds to the above
  • Ability to detect opportunities, not just problems. E.g. product affinity of noticing products long-time product A and brand-new product B could alert management to craft a marketing campaign for product B to those who have bought A in the past.


While "real-time data science" doesn't mean taking conventional, manual, deep data exploration and analysis and applying it to real-time data, what it does mean for data scientists is that they are increasingly feeling the pressure to have some results automated and real-time, and data scientists need to be comfortable with these streaming technologies and algorithms.

Saturday, December 21, 2013


No doubt you've encountered the image below from Gizmodo in some PowerPoint somewhere this year.

But that same PowerPoint likely didn't bother to answer the next logical question:

How to get to causality?

It's not an easy question to answer. Having really, really good correlation is definitely not the answer. First a couple of counterexamples.

Common ancestral cause

Putting aside spurious correlations such as the one above, the much more common scenario is that of a common cause, such as shown below. Finding the correlation of "street wet" and "hair wet" in some data set does not lead to the conclusion that one follows from the other.

Indirect cause

Well, the way to avoid coincidental correlation is apply the "gold standard" of statistics and scientific experimentation, the controlled randomized experiment, right? Consider the following experiment. We set up a bunch of plugged-in microwaves, each with its own cup of room temperature water with a tea bag inserted. For a randomized half of the set of microwaves, we push the "start" button on the microwave (the "treatment" to use the terminology from randomized experimentation), and on the other half we do not push the button.

The results are highly correlated. We've employed the gold standard of scientific experimentation. Can we say that finger pushing causes hot tea? In a sense, yes, but not in the common sense of the word. What happened?

Three things. First, the finger pushing is an indirect cause, as shown below.

Second, to use the terminology from the study of causation, finger pushing is a sufficient cause in the context of the experimental conditions, but it is not necessary. It is not necessary because we could have also arrived at hot tea by opening up the microwave and assaulting the cup with a blowtorch. Although there are a lot of common sense uses of the word "causation" that lack "necessity", the strongest types of causation are both necessary and sufficient.

Third, pushing the button on the microwave is really just a contributory cause, isn't it? Our experimental assumptions included that the microwave was plugged in, so it's getting it's energy from there, as shown below.

Contributory Cause

And so on... we could trace the right branch all the way back to the sun, the Big Bang, and the Aristotelian First Cause. But just that "Electricity generated from coal" makes a much better common sense "cause" than does finger pushing. It's because that is where the "motion" is coming from -- the turbine spinning at the power plant is causing the water to heat up in our microwave. The finger pushing is merely a contributory cause.

Scientific Method

Now the confession, and the heart of the matter. I've (mis)led you down this path to illustrate the complexities of determining causation. Causation is the meat of the philosophical giants around the world and throughout time.

But what about the scientific method? Isn't that supposed to allow us to establish causation through repeated experimentation, at least the "sufficiency" form of causation?
Let's review the process of the scientific method:
  1. Hypothesize
  2. Experiment
  3. Analyze
It's all scientific and straightforward, right? Pay closer attention to step #1. Hypothesize. What is happening there? That's where the magic of causation is happening, and not in step #3, analysis and statistics, where it is often presumed. Hypothesis forming comes from the advanced intellect of people. People form models of how things work in their minds and form hypothesis, and then try to verify or disprove those hypotheses.


Just to be clear, I am not referring to statistical models, but rather models like the atomic model or an engineering model of a car engine. These models can only come from the minds of people (or AI that mimics in some way the minds of people -- e.g. automated Bayesian network construction or automated semantic analysis). Models don't get spit out from a statistical analysis.

And recall the complexities of the philosophy of causation discussed at the beginning of this post. They are still there. So causation requires modeling and philosophy, both of which are hard and messy.

That can't fit on a single PowerPoint, so it's no wonder there's not a slide on it following the infamous IE/Murder correlation slide.

Wednesday, December 4, 2013

SC13 reports

Last month, I attended SC13, the annual International Supercomputing conference sponsored by IEEE and ACM. Below are links to my three reports from my attending there:

Game changer for HPC: IBM and NVidia novel architecture

GPU is common ground between HPC and Hadoop

SC13 wrap-up: Exotic computing technologies