Sunday, March 30, 2014

Quick Way to Play With Spark

If you're interested in a quick way to start playing with Apache Spark without having to pay for cloud resources, and without having to go through the trouble of installing Hadoop at home, you can leverage the pre-installed Hadoop VM that Cloudera makes freely available to download. Below are the steps.

  1. Because the VM is 64-bit, your computer must be configured to run 64-bit VM's. This is usually the default for computers made since 2012, but for computers made between 2006 and 2011, you will probably have to enable it in the BIOS settings.
  2. Install (I use VirtualBox since it's more free than VMWare Player.)
  3. Download and unzip the 2GB QuickStart VM for VirtualBox from Cloudera.
  4. Launch VirtualBox and from its drop-down menu select File->Import Appliance
  5. Click the Start icon to launch the VM.
  6. From the VM Window's drop-down menu, select Devices->Shared Clipboard->Bidirectional
  7. From the CentOS drop-down menu, select System->Shutdown->Restart. I have found this to be necessary to get HDFS to start working the first time on this particular VM.
  8. The VM comes with OpenJDK 1.6, but Spark and Scala need Oracle JDK 1.7, which is also supported by Cloudera 4.4. From within CentOS, launch Firefox and navigate to Click the radio button "Accept License Agreement" and click to download jdk-7u51-linux-x64.rpm (64-bit RPM), opting to "save" rather than "open" it. I.e., save it to ~/Downloads.
  9. From the CentOS drop-down menu, select Application->System Tools->Terminal and then:
    sudo rpm -Uivh ~/Downloads/jdk-7u51-linux-x64.rpm
    echo "export JAVA_HOME=/usr/java/latest" >>~/.bashrc
    echo "export PATH=\$JAVA_HOME/bin:\$PATH" >>~/.bashrc
    source ~/.bashrc
    tar xzvf spark-0.9.0-incubating.tgz
    cd spark-0.9.0-incubating
    SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

That sbt assembly command also has the nice side-effect of installing scala and sbt for you, so you can start writing scala code to use Spark instead of just using the Spark Shell.

Saturday, February 8, 2014

Visualization of How Real-time Cardinality Algorithms Work

In a previous blog post, Real-time data science, I textually described an algorithm that can be used, for example, in real-time data streaming applications to estimate the size (cardinality) of a set.

I don't think a description can convey how it works as well as a visualization, so I created an iPython Notebook. Below are the images from that Notebook, but you'll need to click on the Notebook link if you want to see how the images were generated and the detailed description.

As I described in the previous blog post, the motivation behind trying to count items in a data stream might be, for example, doing the equivalent of a SQL COUNT DISTINCT on a website's clickstream to get the number of unique visitors to that website.

Rather than maintaining in memory a list of unique ID's (say, IP addresses), which can consume a lot of memory for a popular website, instead we rely on a hash function and just keep track of the minimum hash value ever seen. That's right, instead of saving n IP addresses (where n might be a million for a very popular website), we just save one data value, the smallest hash value ever seen.

The scatter plot below shows the result. For the top dotted row, 10 random UUID's were hashed to floating point values in the range [0,1). For the middle row, 100 UUID's and for the bottom row, 1000 UUID's. Then we just count the number of leading zeroes of the smallest hash value by using the log10 function. To get the cardinality we just then raise 10 to the power of that number of leading zeros.

And zooming in to the left portion of the plot:

The larger the set, the greater the chance that a hash value will happen to end up closer to zero (left edge of the plot). To get the size of the set, we just count the number of leading zeros, and then raise 10 to the power of that number.

That leaves a lot to chance, of course, so there are ways to reduce the effects of randomness and improve the accuracy. One straightforward way, described in my iPython Notebook linked above, is to split the set into, say, 10 subsets, apply the algorithm to each subset independently, average the results somehow, and multiply that average by 10 to get the cardinality of the whole original set. So in terms of memory utilization, this approach requires the space of 10 floating point numbers instead of just 1.

In my iPython Notebook, I just do a simple straight average, and it does improve the accuracy significantly, but it's still off by a factor of 3 in the particular case I tried. A better way to average is a geometric mean. And best still is the HyperLogLog algorithm which employs something called the harmonic mean.

The bright folks over at Metamarkets, developers of Druid, ran some experiments two years ago and showed that HyperLogLog produces cardinalities that are 98% accurate.

Sunday, January 19, 2014

Corrgram: Multi-variate visualization

Corrgrams, invented and coined by Michael Friendly in his 2002 American Statistician paper are a powerful and rapid way to visualize a dozen or more dimensions simultaneously when in the exploratory phase of multi-variate analysis. (Note that Corrgrams are sometimes erroneously referred to as Correlograms, which are something completely different for time series analysis.)

The visualization below is an example generated by the R package corrgram on the Lending Club peer-to-peer lending data that was part of the homework assignment for the Coursera class I took a year ago, Data Analysis.

In the visualization above, brightness (more properly, saturation) of red indicates negative correlation and brightness (saturation) of blue indicates positive correlation, meaning weakly correlated dimensions appear grayish.

The bright red box that jumps out is that FICO score is strongly negatively correlated to interest rates. This highlights two unsurprising points: 1) The higher the FICO score, the lower the interest rate, and 2) FICO score has the strongest influence (by eyeball comparison to all the faded blue squares) on interest rate. We also see that things like loan length, debt-to-income ratio and number of open credit lines increase interest rate, with loan length being the strongest of those secondary influences.

But there's more we can pull out of this visualization. Notice that number of inquiries in the last six months (which means the number of inquiries on one's credit report with the FICO scoring agencies coming from all loan or credit applications, not just those from Lending Club) is a strong influence on Lending Club interest rate. But the correlation between number of inquiries in the last six months and FICO, while a negative correlation as expected, is only a weak negatively correlation from its very pale rose color. That suggests that perhaps Lending Club lenders more strongly dislike (and thus penalize) borrowers with a lot of credit inquiries than do conventional lenders. It suggests perhaps that Lending Club lenders dislike (disproportionately so relative to conventional lenders, or at least the FICO scoring system itself) being the "lender of last resort" and assign a higher risk and thus higher interest rate to such situations. This quilt of colors can't tell us all this for certain -- neither numerically in statistics nor certainly in terms of causality -- but it quickly points us onto paths of investigation that could lead to verifying such unanticipated insights.

Now, the 12 dimensions in the above visualization push the envelope of what is practical with corrgrams, whereas data sets in real life often have hundreds of dimensions. In multi-variate analysis, one way to reduce the number of dimensions is to perform a random forest followed by a variable importance plot. While random forest has a reputation of being opaque, one can still easily obtain the list of variables chosen as top nodes most often. From that list, simply pick the first dozen or so and plug them into a corrgram to visualize the interactions amongst the most important deciding variables. This can be improved further through iteration: if two variables, such as, hypothetically, "Average bank balance for past 3 months" and "Average bank balance for past 6 months" are shown to be strongly correlated, you can discard one of those in the corrgram and use that valuable corrgram slot for a different variable.

Saturday, January 11, 2014

Science Data Science

We commonly hear about "data science" in the context of mining marketing or business data, especially "Big Data", but of course scientists have been practicing statistics for centuries.

Timeline of Statistics.

But recall data science is more than just statistics, from Drew Conway's famous Venn Diagram:

Data Science Venn Diagram

For scientists to move out of "traditional research" and into data science, they need to add computer-based skills such as machine learning and big data.

Just data management has long been a problem in the scientific realm. We learned last month that 80% of scientific data from science conducted over the past 20 years is already lost due to poor data retention policies. In recognition of that, U.S. government funding grants now require a data retention plan to be included in funding proposals. And just two days ago, IEEE Spectrum posted a blog article on Gordon Moore's new law, that "big data will lead to big science," and his philanthropic efforts to support that.

But why is scientific data retention so poor? Having worked in the scientific software development field for half my career, I can speculate on several reasons:

  • Scientific data is not amenable to conventional (e.g. relational) databases. Scientific data sets are typically array-based (2D, 3D, 4D, and higher), where the array indices rather than relational metadata describe the data. This is a fancy way of saying a bunch of flat unannotated binary files, but there are reasons scientists use such files: they are compact relative to, say, XML and JSON; they are easy to write software to read and write; and they are not tied to a particular software vendor. With their ease of use, though, comes ease of deletion. Corporate and institutional cultures pay no heed when files get deleted, but try and delete a database and suddenly the resistance increases dramatically. Along with the convenience of binary files preferred by scientists comes the disrespect of files.
  • Until this focus over the past 3-5 years on data retention, funding proposals never included funding for prolonged data retention. Data retention is expensive. Data formats change, from 8" floppies, to 5.25" floppies, to Bernoulli drives, to 3.5" floppies, to Zip drives, to Jaz drives, to QIC tapes, to CD-ROM, to DVD-R, to LTO tapes, to USB drives, not to mention data on raw hard drives: MFM, RLL, SCSI, Ultra SCSI, IDE, SATA, SAS. It takes both labor and capital investment to continually propagate data from one format to another. Not only must data be format shifted, even within a single format it must be refreshed to guard against physical or magenetic decay. Properly maintained data also involves maintaining multiple backups, including at off-site locations.
  • Compounding the issue is that scientific data would retain its value even more than business data. Longitudinal studies of humans or civil engineering edifices can leverage data spanning a century or more. New scientific studies frequently make use of old data sets, applying new techniques or new insights, when such data is available (or, perhaps if it were available).
  • Scientists are not experts in computers, let alone IT "best practices". Scientists typically know enough about computers to get by, but have not yet generally added that third bubble from the Venn diagram.
  • Scientists often have to rely on proprietary and commercial software and systems for data collection. These systems are specialized, and have no open source counterparts due to the lack of economic forces that propel open source software in the realm of business software. Scientific software even often comes tethered to dongles, or works only on proprietary operating systems no longer available (such as MS-DOS). I have even blogged and presented on an alternative to all this, XML/XSL/HTML5 for reports instead of PDF, where I suggest visualization and presentation software programs be encoded in the form of open-source Javascript instead of closed-source proprietary and commercial binaries, but I know of no uptake outside of my singular implementation of the idea.

Data retention is the critical first step to expanding data science in the scientific realm. Without data retention, there can be no statistics and machine learning over data sets that include data from the past or data from other researchers. I can imagine scientists universally adopting tools like R, iPython Notebook, and Weka like a fish to water, but without data, there is no water.

Wednesday, January 1, 2014

Hypothesis formation

Somewhat of a sequel to my earlier blog post on causality, where do hypotheses come from?
The ideal hypothesis:
  • Has basis in a reasonable engineering, physical, or economic, etc. model.
  • Is as simple as can be in terms of number of variables. I.e. Occam's Razor has been applied.
  • Either has been vetted against a number of other hypotheses and selected as the most reasonable, or will be tested along with other reasonable hypotheses.
  • Will be tested in the gold standard, the randomized controlled experiment.
  • Is actionable.
Real life is not ideal, so below I discuss compromises and trade-offs involved in hypothesis formation.

Basis in a Model

As discussed in my causality blog entry, the only way to assign causality is to develop a rational model about how things really work, not just from the output of some multivariate correlation done in R. The best hypotheses are rooted in causation, though it is of course possible to hypothesize anything conjecture at all, including from statistical correlations discovered during data exploration. Discovery from data as a source of hypothesis is better than pulling from thin air, but hypothesizing from a model is best of all. Hypothesizing from data is called induction and hypothesizing from a model is called deduction.


The fewer the variables, the stronger the hypothesis and the more robust it is, by which it is meant the more likely it will hold up to a variety of conditions. E.g., suppose we induce a hypothesis from data exploration that teenage girls that use Twitter like Justin Beiber. A stronger hypothesis (if it turns out to be true) would be to get rid of the Twitter condition, not only because it broadens the potential market for Justin Beiber products, but also because it is more resilient in varied circumstances, such as perhaps a time when (assuming some sort of unlikely calamity befalls Twitter) Twitter is no longer popular and something takes its place.

Vetted Against Competing Hypotheses

When forming a hypothesis, it is important to brainstorm as many different plausable hypotheses as possible, from a variety of sources:
  • As with conventional brainstorming, ask fellow team members and associates for their creative hypotheses.
  • Formulate as complete a model as possible, and from that model identify explanations. E.g., when modeling a consumer:
    • What is the consumer's budget?
    • What is the pay schedule of the consumer?
    • Are there upcoming holidays that would either enhance purchases (in anticipation) or hinder them (due to store or bank closures)?
    • What products complement the products the consumer already owns?
    • What products would enhance the social standing of the consumer?
    • Does the consumer carry credit cards that are accepted?
    • Is the consumer a student?
    The model doesn't have to be complete and fully accurate -- just enough to spark brainstorming. I.e., it's not necessary to create a Bayesian Belief Network or Root-Cause Analysis Fishbone just to hypothesize.
  • Identify leading hypotheses and test them. This is easier said than done. "Identifying" is a nice way of saying "hunch," because the alternative, "test," is very expensive if done by the gold standard, the controlled randomized experiment.
And by so "identifying a leading hypothesis," one becomes subject to the cherry-picking I discussed in the Panel on "Resolved: Traditional Statistics is Dead". It's nice to pick the best hypothesis from a bunch, but to ensure you don't stumble into a spurious correlation, it's ideally necessary to test all similar hypotheses. In the example proctored in the forum, there turned out to be a correlation between Superbowls and presidential elections. Aside from the obvious modeling deficiencies, my response was whether correlations between MLB penants or NHL cups and presidential elections had also been tested. However, the alternative to picking good hypotheses is to leave it to chance, which is not productive. So pick good hypotheses, but beware of spurious correlations, especially if your hypothesis came from induction from the data rather than deduction from a model.

Controlled Randomized Experiment

Controlled randomized experiments are the gold standard, but they are expensive and time consuming. It is much more convenient and quicker to find and test correlations in existing data sets, but such correlations are fraught with problems: population not random throughout independent variables of the new hypothesis, limited data for train vs. test that effectively lead to test data becoming training data, experimental conditions being different, etc.
But from a practical standpoint, "quasi-experiments" (experiments from an existing data set) are the general rule encountered in practice and "experiments" are, realistically, the exception. Compensating for the shortcomings of quasi-experiments will be the subject of a future article.


You can have the most interesting, perhaps even insightful, hypothesis, but if there is no reasonable course of action to take once it is proven, it's a waste of time to prove it.


Good hypothesis formation:
  • Avoids wasting time testing bad hypotheses
  • Saves time that can be redirected toward testing the best hypotheses, including testing hypothesis adjacent to the leading hypotheses to avoid spurious correlations
  • Results in more resilient, more actionable insights.

Saturday, December 28, 2013

Real-time Data Science

Looking back on 2013, the world of Hadoop emerged from the era of batch processing and into streaming processing. In the contextg of "crisp and actionable," actionable often comes with an expiration date. If you don't take action soon enough, it's too late.

As I learned at the February 2013 Strata conference, Electronic Arts instrumented their online games to continuously stream data into their Big Data system. And because much of their revenue now comes from in-game purchases (in games that are often free to play), they need to know immediately if, e.g., in-game purchases suddenly drop off for whatever reason. They need to take action immediately, not next week.

Here I look at four streaming technologies, and then conclude with some pointers to general streaming computation techniques.

Part of the Berkeley Data Analytics Stack, Spark Streaming is a layer on top of Spark. Spark is a distributed, redundant RAM-based data storage and computation system that was accepted as an Apache Incubator project in 2013. Whereas Hadoop distributes data across the hard drives in a cluster of computers, Spark distributes data across the RAMs in a cluster of computers.

Spark Streaming breaks incoming stream into batches of, for example, 2-second batches or 10-second batches, or however long you set the batch window to be. One criticism of Spark Streaming is that it is not "true real time" since it is up to a batch window size behind. Another limitation is that all processing for that batch must complete before the next batch comes in; otherwise, the behavior is undefined. That can leave a lot of compute capacity unused if the system is tuned to handle the worst-case batch, but normal-size batches leave the processors idle for much of the batch time window.

Storm, another Apache project originally from Twitter, is the other big name in streaming. Its biggest and most well-known shortcoming is that it doesn't guarantee that it won't duplicate events. It guarantees each event will delivered at minimum once. Since this can wreak havoc with real-time metric computation, an optional layer called Trident can de-dupe, but at a high performance penalty.

Storm has a great advantage over Spark Streaming in that it has an "ack" system built-in. If processing an event throws an exception or otherwise fails to "ack" its parent in the directed acyclic graph (DAG) of event flow processing, the event can be saved by the parent for later processing when the problem has been fixed.
Storm also has a web-based admin GUI for starting/stopping/monitoring running DAGs.

Since I don't have any first-hand experience with S4, I have to rely on others' evaluations. At a presentation by Ted Dunning of MapR, he indicated that S4 was designed for topologies and situations that are more complex than usually encountered in the real world. A Yahoo! researcher blogged an excellent comparison between S4 and Storm, and boils it down to guaranteed delivery (Storm) vs. state recovery in case of fault (S4).

Druid is different than the rest in that it's more than just streaming; it also provides a data store and querying system. It's an all-in-one solution whereas the others are usually married to Hadoop.

Druid breaks the storage into two parts, one for real-time (typically 30 minutes worth of data), and the other for permanent "deep storage". A single querying facade can query both stores transparently.

The bad news is that the querying is currently limited. With no SQL, queries must be formed using a stilted JSON syntax, and the ability to create complex queries is limited.
But querying is fast. In fact, it's Druid's hallmark. In response to the common problem of cubing data with high dimensionality, the Druid folks decided instead to just be able to query really fast by storing the data in a distributed RAM-based cluster. With fast querying, there's no need to pre-compute and pre-aggregate.

LinkedIn contributed Samza to Apache in September, 2013. It's new and I haven't tried it yet, but it appears to integrate Kafka and YARN more deeply than the others, and it maintains state similar to how Spark Streaming allows with its RDDs (resilient distributed dataset) and its updateStateByKey() method.

While all the others are for self-hosted infrastructure, Amazon Kinesis is strictly for the could. Again, since I haven't used it myself, I rely on another blogger who informs that Kinesis is restricted to a single procedure, and so isn't suited for complex processing. There is also no rollback capability.

Stream Computation Techniques

It's evident fairly quickly that the mean can be computed in an incremental manner. If mn is the mean of x1 ... xn, then mn+1 = (n*mn + xn+1)/(n+1).

Much less obvious is that it's possible to calculate the standard deviation incrementally. Donald Knuth credits a 1962 paper by Welford with a clever technique.

Even more clever is calculating cardinality incrementally, i.e. the number of distinct items in a set, i.e. SELECT COUNT DISTINCT(*) -- and doing so without having to keep the entire dataset in memory. The trick is to use a hash function that uniformly distributes to the range [0.0,1.0) and then keep track of the most number of leading zeroes. E.g. if the set as 100 items, probabilistically we would expect the smallest hash value to end up being around 0.01, or one leading zero. For a set of 1000 items, some hash values by chance would probably end up being less than that, and the smallest overall would probably end up being around 0.001, or two leading zeroes.

Finally, classic machine learning algorithms such as K-means and decision tree construction have been adapted to incremental versions for streaming data. Joao Gama, a Portuguese professor, has assembled them into an excellent book:


Basic metrics permit real-time monitoring for:

  • Fraud detection
  • Fault detection
  • Capacity overload detection

Streaming machine learning adds:

  • Intelligent automated adjustment of thresholds to the above
  • Ability to detect opportunities, not just problems. E.g. product affinity of noticing products long-time product A and brand-new product B could alert management to craft a marketing campaign for product B to those who have bought A in the past.


While "real-time data science" doesn't mean taking conventional, manual, deep data exploration and analysis and applying it to real-time data, what it does mean for data scientists is that they are increasingly feeling the pressure to have some results automated and real-time, and data scientists need to be comfortable with these streaming technologies and algorithms.

Saturday, December 21, 2013


No doubt you've encountered the image below from Gizmodo in some PowerPoint somewhere this year.

But that same PowerPoint likely didn't bother to answer the next logical question:

How to get to causality?

It's not an easy question to answer. Having really, really good correlation is definitely not the answer. First a couple of counterexamples.

Common ancestral cause

Putting aside spurious correlations such as the one above, the much more common scenario is that of a common cause, such as shown below. Finding the correlation of "street wet" and "hair wet" in some data set does not lead to the conclusion that one follows from the other.

Indirect cause

Well, the way to avoid coincidental correlation is apply the "gold standard" of statistics and scientific experimentation, the controlled randomized experiment, right? Consider the following experiment. We set up a bunch of plugged-in microwaves, each with its own cup of room temperature water with a tea bag inserted. For a randomized half of the set of microwaves, we push the "start" button on the microwave (the "treatment" to use the terminology from randomized experimentation), and on the other half we do not push the button.

The results are highly correlated. We've employed the gold standard of scientific experimentation. Can we say that finger pushing causes hot tea? In a sense, yes, but not in the common sense of the word. What happened?

Three things. First, the finger pushing is an indirect cause, as shown below.

Second, to use the terminology from the study of causation, finger pushing is a sufficient cause in the context of the experimental conditions, but it is not necessary. It is not necessary because we could have also arrived at hot tea by opening up the microwave and assaulting the cup with a blowtorch. Although there are a lot of common sense uses of the word "causation" that lack "necessity", the strongest types of causation are both necessary and sufficient.

Third, pushing the button on the microwave is really just a contributory cause, isn't it? Our experimental assumptions included that the microwave was plugged in, so it's getting it's energy from there, as shown below.

Contributory Cause

And so on... we could trace the right branch all the way back to the sun, the Big Bang, and the Aristotelian First Cause. But just that "Electricity generated from coal" makes a much better common sense "cause" than does finger pushing. It's because that is where the "motion" is coming from -- the turbine spinning at the power plant is causing the water to heat up in our microwave. The finger pushing is merely a contributory cause.

Scientific Method

Now the confession, and the heart of the matter. I've (mis)led you down this path to illustrate the complexities of determining causation. Causation is the meat of the philosophical giants around the world and throughout time.

But what about the scientific method? Isn't that supposed to allow us to establish causation through repeated experimentation, at least the "sufficiency" form of causation?
Let's review the process of the scientific method:
  1. Hypothesize
  2. Experiment
  3. Analyze
It's all scientific and straightforward, right? Pay closer attention to step #1. Hypothesize. What is happening there? That's where the magic of causation is happening, and not in step #3, analysis and statistics, where it is often presumed. Hypothesis forming comes from the advanced intellect of people. People form models of how things work in their minds and form hypothesis, and then try to verify or disprove those hypotheses.


Just to be clear, I am not referring to statistical models, but rather models like the atomic model or an engineering model of a car engine. These models can only come from the minds of people (or AI that mimics in some way the minds of people -- e.g. automated Bayesian network construction or automated semantic analysis). Models don't get spit out from a statistical analysis.

And recall the complexities of the philosophy of causation discussed at the beginning of this post. They are still there. So causation requires modeling and philosophy, both of which are hard and messy.

That can't fit on a single PowerPoint, so it's no wonder there's not a slide on it following the infamous IE/Murder correlation slide.