Thursday, May 15, 2014

Apache Spark 1.0 almost here. Is it ready with 16 "unresolved blockers" in Jira?

Apache Spark 1.0 is to be released any day now; currently "release candidate 6 (rc6)" is being evaluated and will be voted upon imminently. But is it ready?

There are currently 16 issues marked as "unresolved blockers" in Jira for Spark, at least one of which is known to produce erroneous data results.

Then there is the state of the REPL, the interactive Spark Shell recently lauded for making Spark accessible to data scientists, as opposed to just hard-core software developers. Because the Spark Shell wraps every user-entered command and class to do its interactive magic, some basic Spark functions fail to operate, such as lookup() and anything requiring equals() on a compound key (i.e. custom Scala class as opposed to just using String or Int for a key) for groupByKey() and other combineByKey() derivatives. It even affects map(), the most fundamental of all functional programming operations.

Even putting the REPL aside and considering just writing full-fledged Scala programs, the native language of Spark, simple combinations such as map() and lookup() throw exceptions.

Don't get me wrong. Spark is a great platform, and is where it should be after two years of open source development. It's the "1.0" badge that I object to. It feels more like a 0.9.2 release.

Sunday, May 11, 2014

GeoSparkGrams: Tiny histograms on map with IPython Notebook and d3.js


Daily variation of barometric pressure (maximum minus minimum for each day) in inches, for the past 12 months. For each of the hand-picked major cities, the 365 daily ranges for that city are histogrammed.

Here "spark" is in reference to sparklines, not Apache Spark. Last year I showed tiny histograms, which I coined as SparkGrams, inside an HTML5-based spreadsheet using the Yahoo! YUI3 Javascript library. At the end of the row or column, a tiny histogram inside a single spreadsheet cell showed at a glance the distribution of data within that row or column.

This time, I'm placing SparkGrams on a map of the United States, so I call these GeoSparkGrams. This time I'm using IPython Notebook and d3.js. The notebook also automatically performs the data download from NOAA.

The motivation behind this analysis is to find the best place to live in the U.S. for those sensitive to barometric volatility.

The above notebook requires IPython Notebook 2.0, which was released on April 1, 2014, for its new inline HTML capability and ease of integrating d3.js.

Sunday, May 4, 2014

Matplotlib histogram plot from Numpy histogram data

Of course Pandas provides a quick and easy histogram plot, but if you're fine-tuning your histogram data generation in NumPy, it may not be obvious how to plot it. It can be done in one line:

hist = numpy.histogram(df.ix[df["Gender"]=="Male","Population"],range=(50,90))
pandas.DataFrame({'x':hist[1][1:],'y':hist[0]}).plot(x='x',kind='bar')

Peak Hard Drive

This past week, Seagate finally announced a 6TB hard drive, which is three years after their 4TB hard drive. Of course, Hitachi announced their hermetically-sealed helium 6TB hard drives in November, 2013, but only to OEM and cloud customers, not for retail sale.

Hard drive capacities are slowing down as shown in the chart below. To account for the shrinking form factors in the earlier part of the history, and to account for exponential growth, I've scaled the vertical axis to be the logarithm of kilobytes (1000 bytes) per liter.


This three year drought on hard drive capacity increases is represented by the plateau between the last two blue dots in the graph, representing 2011 and 2014. The red line extension to 2020 is based on Seagate's prediction that by then they will have 20TB drives using HAMR technology, which uses a combination of laser and magnetism.

However, if the trendline from 2004-2011 had continued, by linear extrapolation on this log scale, hard drives would have been 600TB by 2020.

This is not good news for users of Big Data. Data sizes (and variety and number of sources) are continuing to grow, but hard drive sizes are leveling off. Horizontal scaling is no longer going to be an option; the days of the monolithic RDBMS are numbered. Worse, data center sizes and energy consumption will increase proportional to growth in data size rather than be tempered by advances in hard drive capacities as we had become accustomed to.

We haven't reached an absolute peak in hard drive capacity, so the term "peak hard drive" is an exaggeration in absolute terms, but relative to corporate data set sizes, I'm guessing we did reach peak hard drive a couple of years ago.

QED: Controlling for Confounders

We see it all the time when reading scientific papers, "controlling for confounding variables," but how do they do it? The term "quasi-experimental design" is unknown even to many who today call themselves "data scientists." College curricula exacerbate the matter by dictating that probability be learned before statistics, yet this simple concept from statistics requires no probability background, and would help many to understand and produce scientific and data science results.

As discussed previously, a controlled randomized experiment from scratch is the "gold standard". The reason is because if there are confounding variables, individual members of the population expressing those variables are randomly distributed and by the law of large numbers those effects cancel each other out.

Most of the time, though, we do not have the budget or time to conduct a unique experiment for each question we want to investigate. Instead, more typically, we're handed a data set and asked to go and find "actionable insights".

This lands us into the realm of quasi-experimental design (QED). In QED, we can't randomly assign members of the population and then apply or not apply "treatments". (Even in data science when analyzing e.g. server logs, the terminology from the hard sciences holds over: what we might call an "input variable" is instead called the "treatment" (as if medicine were being given to a patient) and what we might call an "output variable" is instead called the "outcome" (did the medicine work?).) In QED, stuff has already happened and all we have is the data.

In QED, to overcome the hurdle of non-random assignment, we perform "matching" as shown below. The first step is to segregate the entire population into "treated" and "untreated". In the example below, the question we are trying to answer is whether Elbonians are less likely to buy. So living in Elbonia (perhaps determined by a MaxMind reverse-IP lookup) is the "treatment", not living in Elbonia is "untreated", and whether or not a sale was made is the "outcome". We have two confounding variables, browser type and OS, and in QED that is what we match on.

In this way, we are simulating the question, "all else being equal, does living in Elbonia lead to a less likely sale?"

In this process, typically when a match is made between one member of the treated population and one member of the untreated population, both are thrown out, and then the next attempt at a match is made. Now as you can imagine, there are all sorts of algorithms and approaches for what constitutes match (how close a match is required?), the order in which matches are taken, and how the results are finally analyzed. For further study, take a look at the book Experimental and Quasi-Experimental Designs for Generalized Causal Inference.