Sunday, June 8, 2014

SSD to the rescue of Peak Hard Drive?

A couple of months ago, I blogged about Peak Hard Drive, that hard drive capacities were leveling off and how this would impact the footprints of data centers in the era of Big Data. Since then, there have been two major announcements about SSDs that indicate they may come to the rescue:
  1. SanDisk announced 4TB SSD "this year" and 16TB possibly next year. Given that such technologies are typically delayed by one calendar year from their press releases, in the above chart, I've indicated those as becoming available in 2015 and 2016, respectively.
  2. Japanese researches develoepd a technique to improve SSD performance by up to 300%
The 16TB in 2016 is phenomenal and would be four years sooner than the 20TB in 2020 predicted by Seagate. Much more than that, if the 16TB SSD will be in the same form factor as its announced 4TB little brother, then it will be just a 2.5" drive in contrast to the presumed 3.5" form factor for the 20TB Seagate HAMR drive. As you can see in the chart above, the 16TB puts us back on track of the dashed gray line, which represents the storage capacity steady growth we enjoyed from 2004 to 2011.

It is because of the varying form factors that in my blog post two months ago I adopted the novel "Bytes/Liter" metric, which is a volumetric measure in contrast to the more typical "aerial" metric that applies to spinning platters but not to SSDs. (Actually I changed the metric from log10(KB/Liter) from two months ago to log10(Bytes/Liter) now, reasoning that Bytes is a more fundamental unit than KB, that it eliminates the KB vs. KiB ambiguity, and that it makes the chart above easier to read where you can just pick out the MB, GB, TB, PB by factors of 3 of the exponent of 10.) This volumetric metric can handle everything from the 5.25" full-height hard drives of the 1980's to the varying heights of 2.5" hard drives and allow us to linearly extrapolate on the logarithm chart above.

The direct overlay of the SSD line over the HDD line for the years 1999-2014 came as a complete shock to me. SSDs and HDDs have vastly different performance, form factor, price and performance characterstics. Yet when it comes to this novel metric of volumetric density, they've been identical for the past 15 years!

Photo from comparing 9.5mm height 2.5" drive to 15mm

Now, the announced 4TB 2.5" SSD and presumably also the 16TB SSD are not of the typical notebook hard drive form factor. The typical notebook hard drive is 9.5mm tall, whereas these high-capacity SSDs are 15mm tall. They're intended for data center use, such as in the 2U rack below.

The configuration in the 2U chassis above is typical for 2.5" drives: just 24 drives, because they are all accessible from the front panel. I'm not aware of any high-density solutions for 2.5" drives such as those that exist for 3.5" drives, such as the one below that puts 45 drives into 4U.

In time, there should be some higher density rackmount solutions for 2.5" drives appearing, but for now, today's available solutions don't take full advantage of the compactness of 2.5" SSDs portrayed in the above chart, which measures volumtric density of the drive units themselves and not the chassis in which they reside.
Also not clear is whether the 16TB SSD will be MLC or TLC. The 4TB drive is MLC, which means two bits per cell. If the 16TB drive is TLC, then three bits are stored in each cell (eight different voltage levels detected per cell), which can reduce lifespan by a factor of 3 and for that reason are often not considered for enterprise data center use.

For the moment, we're stuck at the inflection point in the above chart at 2014, wondering which dotted line data centers will be able to take in the future.

Due to a combination of increased use of VMs in data centers and increased physical server density, projections were that we had reached peak physical square footage for data centers: that no more data centers would have to be built, ever (aside from technology improvements such as cooling and energy efficiency). The slide above is from SSE. My blog on Peak Hard Drive threatened to blow that away and require more data centers to be built due to plateauing hard drive density combined with exploding Big Data use. But now with the two SSD announcements, we might be -- just maybe -- back on track for no more net data center square footage.

Thursday, May 15, 2014

Apache Spark 1.0 almost here. Is it ready with 16 "unresolved blockers" in Jira?

Apache Spark 1.0 is to be released any day now; currently "release candidate 6 (rc6)" is being evaluated and will be voted upon imminently. But is it ready?

There are currently 16 issues marked as "unresolved blockers" in Jira for Spark, at least one of which is known to produce erroneous data results.

Then there is the state of the REPL, the interactive Spark Shell recently lauded for making Spark accessible to data scientists, as opposed to just hard-core software developers. Because the Spark Shell wraps every user-entered command and class to do its interactive magic, some basic Spark functions fail to operate, such as lookup() and anything requiring equals() on a compound key (i.e. custom Scala class as opposed to just using String or Int for a key) for groupByKey() and other combineByKey() derivatives. It even affects map(), the most fundamental of all functional programming operations.

Even putting the REPL aside and considering just writing full-fledged Scala programs, the native language of Spark, simple combinations such as map() and lookup() throw exceptions.

Don't get me wrong. Spark is a great platform, and is where it should be after two years of open source development. It's the "1.0" badge that I object to. It feels more like a 0.9.2 release.

Sunday, May 11, 2014

GeoSparkGrams: Tiny histograms on map with IPython Notebook and d3.js

Daily variation of barometric pressure (maximum minus minimum for each day) in inches, for the past 12 months. For each of the hand-picked major cities, the 365 daily ranges for that city are histogrammed.

Here "spark" is in reference to sparklines, not Apache Spark. Last year I showed tiny histograms, which I coined as SparkGrams, inside an HTML5-based spreadsheet using the Yahoo! YUI3 Javascript library. At the end of the row or column, a tiny histogram inside a single spreadsheet cell showed at a glance the distribution of data within that row or column.

This time, I'm placing SparkGrams on a map of the United States, so I call these GeoSparkGrams. This time I'm using IPython Notebook and d3.js. The notebook also automatically performs the data download from NOAA.

The motivation behind this analysis is to find the best place to live in the U.S. for those sensitive to barometric volatility.

The above notebook requires IPython Notebook 2.0, which was released on April 1, 2014, for its new inline HTML capability and ease of integrating d3.js.

Sunday, May 4, 2014

Matplotlib histogram plot from Numpy histogram data

Of course Pandas provides a quick and easy histogram plot, but if you're fine-tuning your histogram data generation in NumPy, it may not be obvious how to plot it. It can be done in one line:

hist = numpy.histogram(df.ix[df["Gender"]=="Male","Population"],range=(50,90))

Peak Hard Drive

This past week, Seagate finally announced a 6TB hard drive, which is three years after their 4TB hard drive. Of course, Hitachi announced their hermetically-sealed helium 6TB hard drives in November, 2013, but only to OEM and cloud customers, not for retail sale.

Hard drive capacities are slowing down as shown in the chart below. To account for the shrinking form factors in the earlier part of the history, and to account for exponential growth, I've scaled the vertical axis to be the logarithm of kilobytes (1000 bytes) per liter.

This three year drought on hard drive capacity increases is represented by the plateau between the last two blue dots in the graph, representing 2011 and 2014. The red line extension to 2020 is based on Seagate's prediction that by then they will have 20TB drives using HAMR technology, which uses a combination of laser and magnetism.

However, if the trendline from 2004-2011 had continued, by linear extrapolation on this log scale, hard drives would have been 600TB by 2020.

This is not good news for users of Big Data. Data sizes (and variety and number of sources) are continuing to grow, but hard drive sizes are leveling off. Horizontal scaling is no longer going to be an option; the days of the monolithic RDBMS are numbered. Worse, data center sizes and energy consumption will increase proportional to growth in data size rather than be tempered by advances in hard drive capacities as we had become accustomed to.

We haven't reached an absolute peak in hard drive capacity, so the term "peak hard drive" is an exaggeration in absolute terms, but relative to corporate data set sizes, I'm guessing we did reach peak hard drive a couple of years ago.

QED: Controlling for Confounders

We see it all the time when reading scientific papers, "controlling for confounding variables," but how do they do it? The term "quasi-experimental design" is unknown even to many who today call themselves "data scientists." College curricula exacerbate the matter by dictating that probability be learned before statistics, yet this simple concept from statistics requires no probability background, and would help many to understand and produce scientific and data science results.

As discussed previously, a controlled randomized experiment from scratch is the "gold standard". The reason is because if there are confounding variables, individual members of the population expressing those variables are randomly distributed and by the law of large numbers those effects cancel each other out.

Most of the time, though, we do not have the budget or time to conduct a unique experiment for each question we want to investigate. Instead, more typically, we're handed a data set and asked to go and find "actionable insights".

This lands us into the realm of quasi-experimental design (QED). In QED, we can't randomly assign members of the population and then apply or not apply "treatments". (Even in data science when analyzing e.g. server logs, the terminology from the hard sciences holds over: what we might call an "input variable" is instead called the "treatment" (as if medicine were being given to a patient) and what we might call an "output variable" is instead called the "outcome" (did the medicine work?).) In QED, stuff has already happened and all we have is the data.

In QED, to overcome the hurdle of non-random assignment, we perform "matching" as shown below. The first step is to segregate the entire population into "treated" and "untreated". In the example below, the question we are trying to answer is whether Elbonians are less likely to buy. So living in Elbonia (perhaps determined by a MaxMind reverse-IP lookup) is the "treatment", not living in Elbonia is "untreated", and whether or not a sale was made is the "outcome". We have two confounding variables, browser type and OS, and in QED that is what we match on.

In this way, we are simulating the question, "all else being equal, does living in Elbonia lead to a less likely sale?"

In this process, typically when a match is made between one member of the treated population and one member of the untreated population, both are thrown out, and then the next attempt at a match is made. Now as you can imagine, there are all sorts of algorithms and approaches for what constitutes match (how close a match is required?), the order in which matches are taken, and how the results are finally analyzed. For further study, take a look at the book Experimental and Quasi-Experimental Designs for Generalized Causal Inference.

Sunday, March 30, 2014

Quick Way to Play With Spark

If you're interested in a quick way to start playing with Apache Spark without having to pay for cloud resources, and without having to go through the trouble of installing Hadoop at home, you can leverage the pre-installed Hadoop VM that Cloudera makes freely available to download. Below are the steps.

  1. Because the VM is 64-bit, your computer must be configured to run 64-bit VM's. This is usually the default for computers made since 2012, but for computers made between 2006 and 2011, you will probably have to enable it in the BIOS settings.
  2. Install (I use VirtualBox since it's more free than VMWare Player.)
  3. Download and unzip the 2GB QuickStart VM for VirtualBox from Cloudera.
  4. Launch VirtualBox and from its drop-down menu select File->Import Appliance
  5. Click the Start icon to launch the VM.
  6. From the VM Window's drop-down menu, select Devices->Shared Clipboard->Bidirectional
  7. From the CentOS drop-down menu, select System->Shutdown->Restart. I have found this to be necessary to get HDFS to start working the first time on this particular VM.
  8. The VM comes with OpenJDK 1.6, but Spark and Scala need Oracle JDK 1.7, which is also supported by Cloudera 4.4. From within CentOS, launch Firefox and navigate to Click the radio button "Accept License Agreement" and click to download jdk-7u51-linux-x64.rpm (64-bit RPM), opting to "save" rather than "open" it. I.e., save it to ~/Downloads.
  9. From the CentOS drop-down menu, select Application->System Tools->Terminal and then:
    sudo rpm -Uivh ~/Downloads/jdk-7u51-linux-x64.rpm
    echo "export JAVA_HOME=/usr/java/latest" >>~/.bashrc
    echo "export PATH=\$JAVA_HOME/bin:\$PATH" >>~/.bashrc
    source ~/.bashrc
    tar xzvf spark-0.9.0-incubating.tgz
    cd spark-0.9.0-incubating
    SPARK_HADOOP_VERSION=2.2.0 sbt/sbt assembly

That sbt assembly command also has the nice side-effect of installing scala and sbt for you, so you can start writing scala code to use Spark instead of just using the Spark Shell.