Saturday, December 28, 2013

Real-time Data Science

Looking back on 2013, the world of Hadoop emerged from the era of batch processing and into streaming processing. In the contextg of "crisp and actionable," actionable often comes with an expiration date. If you don't take action soon enough, it's too late.

As I learned at the February 2013 Strata conference, Electronic Arts instrumented their online games to continuously stream data into their Big Data system. And because much of their revenue now comes from in-game purchases (in games that are often free to play), they need to know immediately if, e.g., in-game purchases suddenly drop off for whatever reason. They need to take action immediately, not next week.

Here I look at four streaming technologies, and then conclude with some pointers to general streaming computation techniques.

Part of the Berkeley Data Analytics Stack, Spark Streaming is a layer on top of Spark. Spark is a distributed, redundant RAM-based data storage and computation system that was accepted as an Apache Incubator project in 2013. Whereas Hadoop distributes data across the hard drives in a cluster of computers, Spark distributes data across the RAMs in a cluster of computers.

Spark Streaming breaks incoming stream into batches of, for example, 2-second batches or 10-second batches, or however long you set the batch window to be. One criticism of Spark Streaming is that it is not "true real time" since it is up to a batch window size behind. Another limitation is that all processing for that batch must complete before the next batch comes in; otherwise, the behavior is undefined. That can leave a lot of compute capacity unused if the system is tuned to handle the worst-case batch, but normal-size batches leave the processors idle for much of the batch time window.

Storm, another Apache project originally from Twitter, is the other big name in streaming. Its biggest and most well-known shortcoming is that it doesn't guarantee that it won't duplicate events. It guarantees each event will delivered at minimum once. Since this can wreak havoc with real-time metric computation, an optional layer called Trident can de-dupe, but at a high performance penalty.

Storm has a great advantage over Spark Streaming in that it has an "ack" system built-in. If processing an event throws an exception or otherwise fails to "ack" its parent in the directed acyclic graph (DAG) of event flow processing, the event can be saved by the parent for later processing when the problem has been fixed.
Storm also has a web-based admin GUI for starting/stopping/monitoring running DAGs.

Since I don't have any first-hand experience with S4, I have to rely on others' evaluations. At a presentation by Ted Dunning of MapR, he indicated that S4 was designed for topologies and situations that are more complex than usually encountered in the real world. A Yahoo! researcher blogged an excellent comparison between S4 and Storm, and boils it down to guaranteed delivery (Storm) vs. state recovery in case of fault (S4).

Druid is different than the rest in that it's more than just streaming; it also provides a data store and querying system. It's an all-in-one solution whereas the others are usually married to Hadoop.

Druid breaks the storage into two parts, one for real-time (typically 30 minutes worth of data), and the other for permanent "deep storage". A single querying facade can query both stores transparently.

The bad news is that the querying is currently limited. With no SQL, queries must be formed using a stilted JSON syntax, and the ability to create complex queries is limited.
But querying is fast. In fact, it's Druid's hallmark. In response to the common problem of cubing data with high dimensionality, the Druid folks decided instead to just be able to query really fast by storing the data in a distributed RAM-based cluster. With fast querying, there's no need to pre-compute and pre-aggregate.

LinkedIn contributed Samza to Apache in September, 2013. It's new and I haven't tried it yet, but it appears to integrate Kafka and YARN more deeply than the others, and it maintains state similar to how Spark Streaming allows with its RDDs (resilient distributed dataset) and its updateStateByKey() method.

While all the others are for self-hosted infrastructure, Amazon Kinesis is strictly for the could. Again, since I haven't used it myself, I rely on another blogger who informs that Kinesis is restricted to a single procedure, and so isn't suited for complex processing. There is also no rollback capability.

Stream Computation Techniques

It's evident fairly quickly that the mean can be computed in an incremental manner. If m_n is the mean of x₁ ... x_n, then m_n+1 = (n*m_n + x_n+1)/(n+1).

Much less obvious is that it's possible to calculate the standard deviation incrementally. Donald Knuth credits a 1962 paper by Welford with a clever technique.

Even more clever is calculating cardinality incrementally, i.e. the number of distinct items in a set, i.e. SELECT COUNT DISTINCT(*) -- and doing so without having to keep the entire dataset in memory. The trick is to use a hash function that uniformly distributes to the range [0.0,1.0) and then keep track of the most number of leading zeroes. E.g. if the set as 100 items, probabilistically we would expect the smallest hash value to end up being around 0.01, or one leading zero. For a set of 1000 items, some hash values by chance would probably end up being less than that, and the smallest overall would probably end up being around 0.001, or two leading zeroes.

Finally, classic machine learning algorithms such as K-means and decision tree construction have been adapted to incremental versions for streaming data. Joao Gama, a Portuguese professor, has assembled them into an excellent book:

Applications

Basic metrics permit real-time monitoring for:

Fraud detection
Fault detection
Capacity overload detection

Streaming machine learning adds:

Intelligent automated adjustment of thresholds to the above
Ability to detect opportunities, not just problems. E.g. product affinity of noticing products long-time product A and brand-new product B could alert management to craft a marketing campaign for product B to those who have bought A in the past.

Conclusion

While "real-time data science" doesn't mean taking conventional, manual, deep data exploration and analysis and applying it to real-time data, what it does mean for data scientists is that they are increasingly feeling the pressure to have some results automated and real-time, and data scientists need to be comfortable with these streaming technologies and algorithms.

Saturday, December 21, 2013

Causation

No doubt you've encountered the image below from Gizmodo in some PowerPoint somewhere this year.

But that same PowerPoint likely didn't bother to answer the next logical question:

How to get to causality?

It's not an easy question to answer. Having really, really good correlation is definitely not the answer. First a couple of counterexamples.

Common ancestral cause

Putting aside spurious correlations such as the one above, the much more common scenario is that of a common cause, such as shown below. Finding the correlation of "street wet" and "hair wet" in some data set does not lead to the conclusion that one follows from the other.

Indirect cause

Well, the way to avoid coincidental correlation is apply the "gold standard" of statistics and scientific experimentation, the controlled randomized experiment, right? Consider the following experiment. We set up a bunch of plugged-in microwaves, each with its own cup of room temperature water with a tea bag inserted. For a randomized half of the set of microwaves, we push the "start" button on the microwave (the "treatment" to use the terminology from randomized experimentation), and on the other half we do not push the button.

The results are highly correlated. We've employed the gold standard of scientific experimentation. Can we say that finger pushing causes hot tea? In a sense, yes, but not in the common sense of the word. What happened?

Three things. First, the finger pushing is an indirect cause, as shown below.

Second, to use the terminology from the study of causation, finger pushing is a sufficient cause in the context of the experimental conditions, but it is not necessary. It is not necessary because we could have also arrived at hot tea by opening up the microwave and assaulting the cup with a blowtorch. Although there are a lot of common sense uses of the word "causation" that lack "necessity", the strongest types of causation are both necessary and sufficient.

Third, pushing the button on the microwave is really just a contributory cause, isn't it? Our experimental assumptions included that the microwave was plugged in, so it's getting it's energy from there, as shown below.

Contributory Cause

And so on... we could trace the right branch all the way back to the sun, the Big Bang, and the Aristotelian First Cause. But just that "Electricity generated from coal" makes a much better common sense "cause" than does finger pushing. It's because that is where the "motion" is coming from -- the turbine spinning at the power plant is causing the water to heat up in our microwave. The finger pushing is merely a contributory cause.

Scientific Method

Now the confession, and the heart of the matter. I've (mis)led you down this path to illustrate the complexities of determining causation. Causation is the meat of the philosophical giants around the world and throughout time.

But what about the scientific method? Isn't that supposed to allow us to establish causation through repeated experimentation, at least the "sufficiency" form of causation?
Let's review the process of the scientific method:

Hypothesize
Experiment
Analyze

It's all scientific and straightforward, right? Pay closer attention to step #1. Hypothesize. What is happening there? That's where the magic of causation is happening, and not in step #3, analysis and statistics, where it is often presumed. Hypothesis forming comes from the advanced intellect of people. People form models of how things work in their minds and form hypothesis, and then try to verify or disprove those hypotheses.

Models

Just to be clear, I am not referring to statistical models, but rather models like the atomic model or an engineering model of a car engine. These models can only come from the minds of people (or AI that mimics in some way the minds of people -- e.g. automated Bayesian network construction or automated semantic analysis). Models don't get spit out from a statistical analysis.

And recall the complexities of the philosophy of causation discussed at the beginning of this post. They are still there. So causation requires modeling and philosophy, both of which are hard and messy.

That can't fit on a single PowerPoint, so it's no wonder there's not a slide on it following the infamous IE/Murder correlation slide.

Wednesday, December 4, 2013

SC13 reports

Last month, I attended SC13, the annual International Supercomputing conference sponsored by IEEE and ACM. Below are links to my three reports from my attending there:

Game changer for HPC: IBM and NVidia novel architecture

GPU is common ground between HPC and Hadoop

SC13 wrap-up: Exotic computing technologies

Thursday, September 12, 2013

Ctrl-C shutdown hook in Scala

If you need to intercept Ctrl-C, SIGINT, SIGHUP, etc. in Scala, and handle it on the main thread, it is necessary to join() back to the main thread from within the shutdown hook as below because, unlike C signal handlers, the process terminates at the end of the shutdown handler, even though there is no explicit call to System.exit().

object ctrlctest {
  @volatile var keepRunning = true

  def main(args: Array[String]) {
    val mainThread = Thread.currentThread();
    Runtime.getRuntime.addShutdownHook(new Thread() {override def run = {
      println("inside addShutDownHook handler")
      keepRunning = false
      mainThread.join()
    }})
    while (keepRunning) {println("in while loop")}
    // Graceful shutdown code goes here that needs to be on the main thread
    println("Exiting main")
  }
}

And the resulting output is:

in while loop
in while loop
in while loop
inside addShutDownHook handler
in while loop
Exiting main

Thursday, August 29, 2013

Selecting Pandas data with list comprehension

Given a Dataframe, the Pandas "ix" method allows you to "query" the Dataframe with a condition that resembles a SQL WHERE clause:

df=DataFrame({'rank':[70,21,1000000],
              'domain':['www.cnn.com','www.msn.com','s.down.bad']})
df

	domain	rank
0	www.cnn.com	70
1	www.msn.com	21
2	s.down.bad	1000000

df.ix[df['rank']<100,:]

	domain	rank
0	www.cnn.com	70
1	www.msn.com	21

But even though "df['rank']<100" on its surface resembles a SQL WHERE clause, recall that the .ix method, like the R data frame after which it was patterned, isn't really taking a WHERE clause as its first parameter. It's taking an array of booleans. The df['rank']<100 is returning an array of booleans due to NumPy's broadcasting rules.

However, NumPy doesn't support every possible operator and function. For example, Numpy does not have string functions, since it is, after all, a numeric library. For composing Pandas dataframe selections based on string functions, we can use Python "list comprehension" to generate a list of booleans (which the ix method will accept as its first indexing parameter). A Pandas equivalent of SQL

SELECT *
FROM   df
WHERE  domain LIKE '%s%'

might be

df.ix[['s' in x for x in df['domain']],:]

	domain	rank
1	www.msn.com	21
2	s.down.bad	1000000

But as of Pandas 0.8.1 (released in 2012), Pandas supports operations on vectors of strings, similar to NumPy via str. Using str, the above Python list comprehension can be eliminated and replaced with the more simple alternative below. The .str.contains returns the array of booleans that .ix needs.

df.ix[df['domain'].str.contains('s'),:]

So Python list comprehension is not needed for simple numeric conditions (due to NumPy's broadcasting) or simple string conditions (due to Pandas string vectorization). More complex conditions, though, may still require Python list comprehension. As an example, we can query rows from the above data frame where only those websites are currently up:

import urllib2
def isup(domain):
    try:
        con = urllib2.urlopen(urllib2.Request('http://'+domain))
        return con.getcode() == 200
    except:
        return False

df.ix[[isup(x) for x in df['domain']],:]

	domain	rank
0	www.cnn.com	70
1	www.msn.com	21

Wednesday, August 21, 2013

Unsquish Pandas/Matplotlib bar chart x labels

For a line plot, Matplotlib intelligently chooses x axis ticks and labels. But for bar charts, it blindly tries to print one for each bar, regardless of how many bars there are or how small they are. This can result in labels overprinting each other.

To see an example and the corresponding solution, see my IPython Notebook Solving x axis overprinting on Pandas/Matplotlib bar charts on GitHub.

Before:

After:

Friday, August 9, 2013

Added PNG support to ipyD3: better nbconvert compatibility

Following the suggestion of the original author of ipyD3, I added PNG capability to my fork of his ipyD3. Then, selecting "png" instead of "html" generates output that appears nearly identical. Example invocation:

d3.render(mode=('show','png'))

The disadvantage of PNG over the former HTML rendering is that it precludes any possibility of mouse interaction and animation. The advantage is that nbconvert will convert multiple PNG D3 renderings in the same Notebook, whereas with the HTML renderings nbconvert seemed to give up after the first one. This is using the last version of nbconvert before it was merged into the IPython project. I have not tried the beta versions of IPython 1.0; I'm waiting for the Anaconda release.

Saturday, August 3, 2013

BootIt Bare Metal for testing semi-embedded systems

I use the term "semi-embedded system" to refer to a PC loaded with hardware, such as data acquisition A/D devices, motion controllers, digital I/O lines, etc. When I create installers for semi-embedded software I write, I like to make them as turnkey as possible. That means ensuring, through testing, that the installs work on fresh (and other not-so-fresh, but controlled and known) copies of Windows. BootIt Bare Metal (BIBM) by Terabyte Unlimited is indispensable for testing installs that include device drivers and other software that are not so easily uninstalled (sometimes from including very specialized software as sub-installs, which often do not come with clean uninstallers).

BIBM is a multi-booter like Grub and the built-in Windows boot menu, but so much more. It's also a partition editor like PartitionMagic or GParted, and a backup facility like Ghost. Below is a screenshot of my boot menu.

As you can see, the three operating systems I can boot from are: Windows 7 main, Ubuntu, and Windows 7 test. Below is a list of the partitions I've configured with BIBM.

But wait, how can there be so many partitions? Aren't you limited to just four? BIBM supports its own type of extended partitions (information about which are stored in that special BootIt EMBRM partition) and swaps them in and out of the regular max-four-partition MBR on the fly. That not only allows being able to multi-boot a large number of different operating systems (e.g. XP, W7, W8, Ubuntu 12, Ubuntu 13, etc.), it also enables the testing of installs. As you can see from the list of partitions, I keep clean copies of Windows 7 at the end of the hard drive -- specifically, one that is just a super-fresh install from the DVD, and the other that has Windows Updates run on it.

To test an install, I just copy and paste one of those saved partitions to the remaining blank area of the hard drive (denoted by BIBM with the line of hyphens "---"). I can do this iteratively to develop and debug an install that includes driver installs as sub-installs, without fear that I get "only one shot" to test it on fresh computer.

Even if an install doesn't involve drivers, this technique is useful from a licensing perspective. Windows 7 requires separate licenses for virtual machines, because Microsoft explicitly considers virtual machines to be separate machines. But I have not found where Microsoft forbids making backup copies on extra hard-drive partitions on a single machine. I am not a lawyer, so do not consider this to be legal advice that it is permissible to do so with a single license. But for VMs it is well-known you must have separate licenses.

Windows 8 and UEFI

Sadly, this technique is threatened with the advent of UEFI SecureBoot and Windows 8. BIBM still works, but the BIOS and OS must support a "legacy mode". It is reasonable to expect that "legacy mode" will become more rare in the future. TeraByte Unlimited is silent about whether it will support or bypass UEFI in the future.

Tips on installing Linux under BIBM

Installing Linux under BIBM requires a specific set of steps. The most important thing to remember is to install Grub onto the Linux partition (e.g. dev/sda2) rather than the MBR (e.g. dev/sda).

Additionally, the automatic install of Grub can sometimes fail. In that case, it is necessary to follow Terabyte Unlimited's instructions on manually installing Grub. If that doesn't work, it may be necessary to live-boot with the help of a Linux DVD into the Linux installation on your hard drive, and then reinstall the grub reinstaller:

sudo apt-get install --reinstall grub-pc

Monday, July 29, 2013

Choropleth in D3.js and Pandas (IPython Notebook)

UPDATE 2014-06-08: This post is outdated as it is for IPython Notebook 1.0. Please see GeoSparkGrams: Tiny histograms on map with IPython Notebook and d3.js for IPython Notebook 2.0.

A lot of people have done a mash-up of D3 with IPython Notebook. Some efforts created a floating overlay over the Notebook rather than creating the output in the standard Notebook inline format. More recent efforts have utilized the Notebook's publish_html() to generate the output inline. One of the more advanced such efforts, ipyD3, however, works only on Windows. I've forked his gist and modified the couple of lines to make it Mac compatible. There is a small chance it's still Windows-compatible with my changes, but I haven't tested it. I'm almost certain the changes allow it to work on Linux too, but again, I've only tested it on a Mac.
I posted a notebook that generates the Choropleth below.

Besides demonstrating how to use D3 from IPython Notebook, it also demonstrates use of geographical maps in D3, itself not straightforward (or at least not built-in).
To transfer a Pandas Dataframe to ipyD3, I convert it to 2D Numpy array. In this particular example, I could have instead just converted the Dataframe to a dict and then passed a dict to ipyD3, since that is one of the data types it is able to marshal to the Javascript, but I wanted to show a more general approach of passing any Dataframe to ipyD3. Numpy arrays preserve column order, unlike quick examples I found on the web of converting Dataframes to JSON (which use non-order preserving dicts as an intermediate form), but at the expense of stripping out the column names. If your custom D3.js code needs column names, you'll have to pass that in as an additional Javascript variable.
The map shape data comes from Wikipedia, which has each state conveniently identified by its two-letter postal code for the SVG id and by the SVG class name of "state". The unemployment data is just something I found on GitHub.
Before executing this example, you'll need to download the ipyD3.py from my gist and put it in the same directory as where you launch IPython Notebook from.
Printing is a challenge. The "long paper" PDF technique below works, but only on the first inline ipyD3. The publish_html technique employed by ipyD3 is not foolproof; full-fledged D3.js support is not expected in IPython Notebook until version 2.0 (and version 1.0 isn't even out yet).
UPDATE 2013-07-30: Forgot to mention that you also need to install and download PhantomJS.
UPDATE 2013-08-06: I discovered it's possible to convert a Pandas Dataframe to a Numpy array directly with just array() and dropping the .to_records().tolist(). Doing so drops the first column, the row indexes, but those are not usually needed. If you modify my example to omit the .to_records().tolist(), you'll also need to reduce each of the hard-coded Javascript column indexes by 1.
UPDATE 2014-06-08: This post is outdated as it is for IPython Notebook 1.0. Please see GeoSparkGrams: Tiny histograms on map with IPython Notebook and d3.js for IPython Notebook 2.0.

Friday, July 19, 2013

On Mac, only Firefox can PDF without page breaks

An alternative to nbconvert for IPython Notebook is to specify a long custom page size, such as 8.5"x60". This will work for any web page, not just IPython Notebooks. But on a Mac, this option has been removed from Safari 6 and is not available on the current Chrome version. Firefox still lets you, however:

In the Firefox drop-down menu, select File->Page Setup->Paper Size->Manage Custom Sizes...
In the Custom Paper Sizes dialog, click the "+" button beneath the list box on the left to add a new custom paper size.
Click the newly created name to change the name to something meaningful.
Change the Paper Size Height to something like 60 inches and click OK.
As normal, from the Firefox menu select File->Print and then in the Print dialog click PDF->Save as PDF.

Thursday, July 18, 2013

SERDEPROPERTIES required for Avro in Hive 0.11

To specify an Avro-backed Hive table, the Apache Wiki and the Cloudera Avro documentation both prescribe specifying the Avro schema in TBLPROPERTIES. This is no longer supported in Hive 0.11. It is now necessary to use SERDEPROPERTIES:

CREATE EXTERNAL TABLE mytable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
  'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Otherwise if TBLPROPERTIES is used to specify the location of the Avro schema, Hive 0.11 won't be able to find it and the following exception will be thrown:

java.io.IOException(org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema)

Monday, July 15, 2013

Quick link to download nbconvert

As a quick update to my nbconvert post, the nbconvert code has been deleted from its github tip and merged into the main IPython project. The problem is, IPython 1.0 isn't out yet. To use nbconvert with the current IPython 0.13 release, you can download the last stand-alone nbconvert .zip archive from https://github.com/ipython/nbconvert/archive/173bb08dd86d02a7485801969c94d4816913cd09.zip.

UPDATE 2013-07-17: Announcement today on the ipython-dev mailing list that IPython 1.0 won't be released until early August.

Wednesday, July 3, 2013

How to install VS2012 Update 3

Update 3 for Microsoft Visual Studio 2012 was released on June 26, 2013, but installing it is obvious only via Internet-based install. I refuse to perform Internet-based installs for development tools, so that I can retain the ability to recompile my source code 10-20 years in the future, long after update servers have been shut down. Although an ISO for Update 2 was well-advertised, the ISO for Update 3, at this moment, is hidden and difficult to install. Below are the steps.

Download the Update 3 ISO from the non-advertised link http://go.microsoft.com/fwlink/?LinkId=301705.
Before executing the install, it may be necessary to update your Microsoft root certificates. This can be done via KB931125. Note that it may be necessary to visit that link twice -- once to give permissions and a second time to actually save/execute the needed rootsupd.exe.

Monday, July 1, 2013

nbconvert: PDF from iPython Notebook

UPDATE 2013-07-19: For details on the alternative to nbconvert briefly mentioned below, long custom PDF page sizes, see On Mac, only Firefox can PDF without page breaks.

UPDATE 2013-07-15: See Quick link to download nbconvert for updated download location.

The official iPython Notebook documentation states that to convert a notebook to PDF, you should use your browser's "Print to PDF" capability. The problem is that that chops charts and graphs in half due to PDF pagination (unless you are able to configure a custom PDF paper size e.g. 60 inches long).

A command-line utility, nbconvert, which will eventually be merged into IPython but is not yet, nicely converts Notebooks to PDF. It even includes nice instructions on installing on a Mac, but the instructions are only 98% complete. Below are the missing steps:

Download the nbconvert package as a Zip and unzip to your home directory.
Follow the steps from https://github.com/ipython/nbconvert/blob/173bb08dd86d02a7485801969c94d4816913cd09/README.rst, specifically:
1. pip install jinja2
2. pip install markdown
3. curl http://docutils.svn.sourceforge.net/viewvc/docutils/trunk/docutils/?view=tar > docutils.tgz
4. pip install -U docutils.tgz
5. pip install pygments
6. sudo easy_install -U sphinx
7. Install PanDoc via the installer http://code.google.com/p/pandoc/downloads/list
8. Install MacTex via the .pkg http://www.tug.org/mactex/

Execute to convert your .ipynb to .pdf:

export PYTHONPATH=~/nbconvert-master/nbconvert/utils
python ~/nbconvert-master/nbconvert1/nbconvert.py --format=pdf MyNotebook.ipynb

Wednesday, June 26, 2013

Create empty DataFrame in Pandas

It seems like it should be a simple thing: create an empty DataFrame in the Pandas Python Data Analysis Library. But if you want to create a DataFrame that

is empty (has no records)
has datatypes
has columns in a specific order

...i.e. the equivalent of SQL's CREATE TABLE, then it's not obvious how to do it in Pandas, and I wasn't able to find any one web page that laid it all out. The trick is to use an empty Numpy ndarray in the DataFrame constructor:

df=DataFrame(np.zeros(0,dtype=[
('ProductID', 'i4'),
('ProductName', 'a50')]))

Then, to insert a single record:

df = df.append({'ProductID':1234, 'ProductName':'Widget'})

UPDATE 2013-07-18: Append is missing a parameter:

df = df.append({'ProductID':1234, 'ProductName':'Widget'},ignore_index=True)

Tuesday, June 18, 2013

Histogram thumbnails inside YUI3 data grids

Sparklines are tiny graphs added inline to data tables or lists. Sparklines have no units or gridlines; the purpose is just to quickly convey the shape of the data, which is usually presented adjacent.

A powerful variant on sparklines are sparkline histograms, or "sparkgrams" as I like to call them. Sparkgrams can quickly and compactly convey data distributions: outliers, right/left skew, normal vs. uniform distribution, etc. Because they are small, they can be included right inside a data table for every column or row. Such a presentation is useful either for completely known data when embarking upon data analysis, or for newly received data of a known data type in, for example, a manufacturing quality control scenario.

Below is an example of how it might look. The first column demonstrates an outlier, while the second column conveys a normal distribution.

Below is the YUI3 code to render the above.

<!DOCTYPE html>
<head>
 <script src="yui/build/yui/yui-min.js"></script>
 <script>
  YUI().use('datatable', 'charts', function(Y) {
   var fmt = function(o) {
    if (o.value.substring && o.value.substring(0,5) === "chart")
     o.cell.set("innerHTML", "<div id='" + o.value +
                "' style='width:75px;height:75px'></div>");
    else
     o.cell.set("text", o.value);
    return false;
   }

   var table = new Y.DataTable({columns: [{"key":"Indoors",  "nodeFormatter":fmt},
                                          {"key":"Outdoors", "nodeFormatter":fmt}],
                                data: [{"Indoors": 73.3, "Outdoors": 86.2},
                                       {"Indoors": 73.5, "Outdoors": 86.5},
                                       {"Indoors": 73.5, "Outdoors": 86.6},
                                       {"Indoors": 50.7, "Outdoors": 86.8},
                                       {"Indoors": "chart0", "Outdoors": "chart1"}],
                                render: "#table"});

   for (col = 0; col < table._displayColumns.length; col++) {
    var a = Array();
    for (row = 0; row < table.get("data").size()-1; row++)
     a.push(table.getCell([row,col]).get("text"));
    var NUM_BINS = 3;
    var amin = Math.min.apply(Math, a);
    var binwidth = 1.001 * (Math.max.apply(Math, a)-amin) / NUM_BINS;
    var data=Array();
    for (i=0; i<NUM_BINS; i++)
     data.push({category:amin+i*binwidth, values:0});
    for (i=0; i<a.length; i++)
     data[Math.floor((a[i]-amin)/binwidth)].values++;
    new Y.Chart({dataProvider: data, type:"column",
                 axes: {values: {position:"none"}, category: {position:"none"}},
                 render:"#chart"+col});
   }
  });
 </script>
</head>
<body>
 <div id="table" class="yui3-skin-sam"></div>
</body>
</html>

Thursday, June 13, 2013

Query Hive from iPython Notebook

iPython Notebook together with pandas comprise a data analysis system that is essentially a clone of R. The advantage over R is that Python code can be more easily converted into production code and executed, for example, on a web server.

One limitation is that the Apache-recommended Python interface to Hive requires installing Hive locally, which can be problematic or inconvenient from, say, a laptop used for analysis. A nice alternative is remotely executing Hive on the Hadoop cluster (or adjacent Linux server with access to the cluster). The alternative does require installing sshpass. At the end of the code below, the query results are stored in a pandas DataFrame with column names and automatic detection of data types.

from pandas import *
from StringIO import StringIO

s = "sshpass -f myfilewithpassword ssh myusername@myhostname \"hive -S -e \\\"" \
"set hive.cli.print.header=true;" \
"SELECT * from mytable;\\\"\""

t = !$s
df = read_csv(StringIO(t.n), sep='\t')

UPDATE 2013-08-01: It's a bit cleaner if one uses Python's triple-quoting mechanism as below. This allows one to copy and paste queries between iPython Notebook and Hive and Hue without having to reformat with quotes and backslashes.

from pandas import *
from StringIO import StringIO

s = """
sshpass -f myfilewithpassword ssh myusername@myhostname \"
hive -S -e \\\"
set hive.cli.print.header=true;
SELECT * from mytable;
\\\"\"
"""

t = !$s
df = read_csv(StringIO(t.n), sep='\t')

Tuesday, May 28, 2013

New in Hive 0.11: Windowing and analytics functions

It's not getting a lot of press, but Hive 0.11, released May 15, 2013, introduced a ton of new keywords. Hive 0.11 supports the SQL analytic functions similar to what popular RDBMS's introduced a decade ago with the SQL2003 standard, with the following new Hive keywords:

LEAD
LAG
FIRST_VALUE
LAST_VALUE
OVER
WINDOW
UNBOUNDED
PRECEDING
FOLLOWING
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

Sadly, the documentation is scanty, to say the least. For now, documentation and tutorials (e.g. sqlatoms.com) on equivalent keywords in Oracle and SQL Server, for example, will have to suffice.

Being able to window obviates some of the need for subqueries (specifically, pre-SQL2003 subqueries that just provided a limited GROUP BY in a separate SELECT column), which of course Hive does not support.

Monday, May 20, 2013

Arbitrary cell formatting in YUI3 DataTable

YUI is an underrated JavaScript framework, in my opinion. I suspect it isn't very popular because it had dependencies on Flash as late as 2011, and because there were major API changes between YUI2 and YUI3, resulting in a split community and a paucity of documentation and forum help for YUI3. But YUI3 is a rich and free collection of rich web front-end components.

An example of weak documentation and examples is custom formatting of DataTable cells. What was the "formatter" property in YUI2 has been split in YUI3 into "formatter" for changing the just cell text and "nodeFormatter" for changing (actually, supplying) the cell text and the cell property. Since when using a nodeFormatter, one must explicitly transfer the contents of the cell data to the cell display, I decided: why not just embed the formatting right into the cell data? Normally, it is expected that you would maintain a separate external array containing the formatting (e.g. an array of Booleans indicating which cells should be highlighted in red), but with the technique below, the formatting information is embedded right with the cell data in the form of a JavaScript/JSON object for each cell.


<!DOCTYPE html>
<head>
 <script src="yui/build/yui/yui-min.js"></script>
 <script>
  var fmt = function(o) {
      if (o.value) {
          if (o.value.value)
              o.cell.set('text', o.value.value);
          if (o.value.classname)
              o.td.setAttribute("class", o.value.classname);
      }
      return false;
  }

  var columns = [{"key":"Part #", "nodeFormatter":fmt},
                 {"key":"Part Name", "nodeFormatter":fmt}];
  var data = [{"Part #":{"value":1234},
               "Part Name":{"value":"Capacitor"}},
              {"Part #":{"value":5678},
               "Part Name":{"value":"Resistor", "classname":"redbackground"}}];

  YUI().use('datatable', function(Y) {
   new Y.DataTable({columns: columns, data: data, render: "#table"});
  });
 </script>
 <style>
  .redbackground { background: red; }
 </style>
</head>
<body>
 <div id="table" class="yui3-skin-sam"></div>
</body>
</html>

UPDATE 2013-05-23: Thanks to Luke Smith's post at the YUI forum, it turns out it is possible to use formatter, which is more performant than nodeFormatter. The trick, though, is using formatter requires a more specific CSS selector. Full code below.

<!DOCTYPE html>
<head>
 <script src="yui/build/yui/yui-min.js"></script>
 <script>
  var fmt = function(o) {
      if (o.value.classname)
          o.className = o.value.classname;
      return o.value.value;
  }

  var columns = [{"key":"Part #", "formatter":fmt},
                 {"key":"Part Name", "formatter":fmt}];
  var data = [{"Part #":{"value":1234},
               "Part Name":{"value":"Capacitor"}},
              {"Part #":{"value":5678},
               "Part Name":{"value":"Resistor", "classname":"redbackground"}}];

  YUI().use('datatable', function(Y) {
   new Y.DataTable({columns: columns, data: data, render: "#table"});
  });
 </script>
 <style>
  #table .redbackground { background: red; }
 </style>
</head>
<body>
 <div id="table" class="yui3-skin-sam"></div>
</body>
</html>

Monday, May 13, 2013

Escaping the HTML5 sandbox with Qt

HTML5 has a large number of powerful APIs that make it a candidate for cross-platform development of rich desktop-type apps. One last missing piece stands in the way: unfettered access to the local filesystem: local files, thumb drives, and network shares. Perhaps someday there will be a facility to "sign" HTML5 apps to allow them access to the local filesystem, but it is not the case today.

Qt is a stopgap measure to allow you to keep 98% of your app code in HTML5, where the other 2% is Qt C++ to access the local filesystem. This preserves cross-platform capacity and potential, including to the cloud. (In contrast, a conventional Qt app without HTML5 would not be able to be hosted on the cloud.)

Below are the slides from the presentation I gave on May 13, 2013 at the Denver HTML5 Meetup group. And below that is a YouTube with audio narration, recreating the presentation.

Escaping thesandboxdenverhtml5meetup20130514 from michaelmalak

Tuesday, April 30, 2013

Simulating theta joins in Hive

Hive famously only supports "equijoins" (equality of columns between two tables) and not "theta joins" (inequalities). Starting with Hive 0.10, a theta join can be simulated with a cross join. Instead of

-- Does not work!
SELECT a.x
FROM a
JOIN b
ON a.y <= b.y

One can do

-- Hive 0.10 and later
SELECT a.x
FROM a
CROSS JOIN b
WHERE a.y <= b.y

With CROSS JOIN being new to Hive 0.10, for earlier versions of Hive, the workaround is to create additional dummy columns in tables a and b called "one", populated all with the number 1. Then it becomes:

-- Hive 0.9 and earlier
SELECT a.x
FROM a
JOIN b
ON a.one=b.one
WHERE a.y <= b.y

Now, a cross join of course requires a lot of space, the product of the table sizes to be exact. It works fine if that product is in the millions, but if it's in the billions or more, YMMV.

Wednesday, April 24, 2013

Presentation on Spark

Last night I gave a presentation on Spark, the exciting replacement for Hadoop's Map/Reduce that's come out of Berkeley and not tied to any particular distribution of Hadoop. Slides and video are below (video is cued to start at my part at 12:25, and it goes for about 35 minutes).

Spark andshark20130417 from michaelmalak

Thursday, March 21, 2013

Kafka tips

Three tips on using Kafka:

Use buffer.remaining() instead of buffer.array().

Kafka's message.payload() returns a ByteBuffer, and if you need access to the raw bytes, it's tempting to just call message.payload().array(), but that would be the wrong thing to do. That returns the backing array to the ByteBuffer, which may (and in the case of Kafka usually is) larger than the data it represents, with offsets and limits maintained by ByteBuffer.

Sadly, there is no single-line solution. Below is the most concise way to get the raw bytes out of a Kafka message and into a plain vanilla byte[] array.
```
ByteBuffer buffer = message.payload();
byte[] bytes = new byte[buffer.remaining()];
buffer.get(bytes);
```
If sending Avro over Kafka, then serialize using EncoderFactory instead of DataFileWriter with ByteArrayOutputStream, for two reasons:
- DataFileWriter writes out an entire Avro file, complete with schema. Sending the schema with every Kafka message is a waste.
- If you plan to consume Kafka messages with Camus, then you need to use EncoderFactory because Camus uses DecoderFactory.
encoder.flush()
If sending Avro over Kafka and using EncoderFactory, then don't forget to call encoder.flush() after writing, as in this Avro Hello World example.

Saturday, March 9, 2013

Automatically deskew before machine learning in R

A common data pre-processing step in R is to deskew data, which is where if a histogram shows a lopsided distribution, apply a function such as log() before fitting a model. If there are a large number of columns, it can be tedious to eyeball each histogram, and manually substitute offending columns with their log() counterparts.

Helpfully, the e1071 package (notably for its support vector machine algorithms) provides a handy function to measure the skewness of data, called skewness(). Below is a function to automatically deskew an entire range of columns of a data frame.

deskew <- function(df, mincol=1, maxcol=ncol(df), threshold=1.10) {
  for (i in mincol:maxcol) {
    t <- log(1+df[[i]]-min(df[[i]]))
    if (abs(skewness(df[[i]])) > threshold * abs(skewness(t)))
      df[[i]] <- t
  }
  df
}

Deskewing data improves the performance of linear models, both regular lm()/glm() and linear svm() support vector machines. Understandably, it doesn't help with decision trees such as randomTrees().

Monday, March 4, 2013

Fixing Hive after Cloudera 4.20 update

The Cloudera 4.20 update breaks Hive, producing errors such as

*org.datanucleus.store.rdbms.exceptions.MissingTableException: Required
table missing : "`SKEWED_STRING_LIST`" in Catalog "" Schema "". DataNucleus
requires this table to perform its persistence operations. Either your
MetaData is incorrect, or you need to enable "datanucleus.autoCreateTables")

It's because the deployment script doesn't call the Hive metastore upgrade script.

To manually upgrade the Hive metastore from the 0.9 schema to 0.10 (assuming you are using MySQL for the Hive metastore as recommended by Cloudera):

cd /usr/lib/hive/scripts/metastore/upgrade/mysql
sudo mysql --user=root
use metastore;
source upgrade-0.9.0-to-0.10.0.mysql.sql

Friday, March 1, 2013

Strata 2013 Santa Clara trip report

Strata, the O'Reilly conference on Big Data, just wrapped up Feb 26-28, 2013 in Santa Clara. The story of 2013 is we're entering the post-Hadoop -- read all the juicy details about Big Data, Data Science, and Visualization in my daily trip reports:

Strata Trip Report: Days 0 & 1
Strata Trip Report: Day 2
Strata Trip Report: Day 3

Tuesday, February 19, 2013

Hive STRUCT tips

Hive supports SQL1999-style STRUCTs (and ARRAYs), whether they be native Hive tables, Avro-backed tables, or other SerDe.

Tip 1: INSERT

While querying is straightforward using the dot notation familiar from Java, etc.

SELECT z.a FROM oc;

INSERTing requires use of the peculiar named_struct:

echo 1,2 >twovalues.csv
hive
CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE tc;
CREATE TABLE oc (z STRUCT<a: int, b: int>);
INSERT INTO TABLE oc SELECT named_struct('a', x, 'b', y) FROM tc;

Tip 2: NULLs

HiveQL DDL has no option in its CREATE TABLE syntax to specify that a column be NULLable. That's because in the Hadoop/HDFS/Big Data philosophy, data is in general allowed to be missing, and that translates nicely to the SQL concept of NULL in Hive/HiveQL.

However, that doesn't extend to STRUCTs in HiveQL. Apparently neither columns within STRUCTs nor the STRUCTs themselves can be NULLs.

Thus there is this odd restriction: while all regular Hive columns must be NULLable, all STRUCT Hive columns must be non-NULLable. That's quite a polarization of behavior for a DDL that doesn't even have a keyword "NULL".

The following:

INSERT INTO TABLE oc SELECT null FROM tc;

produces the error

FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from void to struct<a:int,b:int>.

And the following:

INSERT INTO TABLE oc SELECT named_struct('a', null, 'b', null) FROM tc;

produces the error:

FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from struct<a:void,b:void> to struct<a:int,b:int>.

Thursday, February 7, 2013

Avro can append in HDFS after all

As an update to the previous post, it turns out Avro can append to an HDFS file after all. The capability is so obscure that even the experts on the Apache Avro mailing list weren't sure of the succinct way of accomplishing it. Sample code from GitHub.

But the demand for this capability will only grow, now that Hive supports Avro directly as of the Hive 0.10 release in January.

Tuesday, February 5, 2013

Avro Tips

Apache Avro is a binary compressed fixed-schema file format meant for Hadoop. As of Hive 0.9.1 and 0.10, Hive can read Avro files directly. Cloudera also back-ported Avro compatibility for its Cloudera Hive 0.9.0.

Avro is highly structured, supporting nested records and arrays, and that makes it a good match for Hive, whose HiveQL syntax adopted SQL1999-style records/structures and arrays.

Avro is a great space-saving alternative to JSON, especially since it's not possible for Apache Pig to read gz-compressed JSON.

This is a collection of tips on using Avro -- most of them can be found here and there on the web, but here they all are in one place.

Schema tips

Below is an example schema that shows nested structures, nullable fields, optional sub-structures, arrays of structures, and arrays of strings.

{"name": "event", "type": "record", "fields": [
  {"name": "logLevel", "type": ["null", "int"], "default": null},
  {"name": "details", "type": 
    {"type": "record", "name": "detailstype", "fields": [
      {"name": "eventname", "type": ["null", "string"], "default": null},
    ]}
  },
  {"name": "metadata", "type": ["null",
    {"type": "record", "name": "metadatatype", "fields": [
      {"name": "referrer", "type": "string"},
    ]}]
  },
  {"name": "messages",   "type": ["null",
    {"type": "array", "name": "messagestype", "items": 
      {"name": "message", "type": "string"}
    }],
    "default": null
  }
]}

Declaring a 1-to-1 sub-record requires the verbose style above, where the "type" is declared to again to be a "type", which is of "record".
In order for a field to be "nullable", it must be declared as a UNION with type "null" as shown above for "type" using the JSON array syntax.
In order to avoid having to explicitly set a field when writing out an Avro record, it must be declared as "default": null
Combining the above two points, in order to declare a field nullable and default: null, the "null" in the UNION must appear first because it is the first type in a UNION list that determines the data type for the default, and "null" is an invalid default for all data types (e.g. "string", "int", etc.) with the exception of the "null" data type.
The above applies to sub-records as well. In the example above, "details" is a required sub-record (with an optional field "eventname"), whereas "metadata" is an optional sub-record (with, if the sub-record is present, a required field "referrer").
When writing out sub-records using the "generic" schema rather than the "specific" schema, the Java code to instantiate an empty sub-record is
```
GenericData.Record childdatum =
new GenericData.Record(schema.getField(fieldname).schema());
```
However, if the sub-record is optional (nullable), then to dig past the UNION, the Java code becomes
```
GenericData.Record childdatum =
new GenericData.Record(schema.getField(fieldname).schema().getTypes().get(1));
```
In both cases, the childdatum is populated and added to the parentdatum via
```
childdatum.put("aChildStringField", "A string value");
parentdatum.put(fieldname, childdatum);
```
Appending to existing Avro files in HDFS is not supported, even though appending to existing Avro files on the local filesystem and appending to existing non-Avro HDFS files are both supported. The open one-year-old unassigned JIRA ticket for such a feature is AVRO-1035.

To view a .avro file:

java -jar avro-tools-1.7.3.jar tojson mydata.avro

Hive using Avro

Example schema:

CREATE EXTERNAL TABLE mytable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc');

[UPDATE 2013-07-18: The above Hive table creation TBLPROPERTIES no longer works with Hive 0.11. Instead one must use SERDEPROPERTIES.]

Although Apache documentation gives examples of 'avro.schema.url'='file:///home/...' that actually doesn't work for any non-trivial Hive query because map/reduce doesn't have access to the local filesystem.
It's not required to specify LOCATION for where the external Hive data will be stored. If it is not specified, then Hive creates the directory in HDFS of /user/hive/warehouse/<tablename>
It is possible to arbitrarily dump a bunch of Avro files into a Hive table directory, and Hive will automatically read them all when executing a query.
Not specific to Avro, but it is possible to specify one or more levels of PARTITION for a Hive table. Each level of PARTITION corresponds to another level of directory underneath /user/hive/warehouse/
(e.g. for a two-level PARTITION one might have a directory /user/hive/warehouse/mytable/datestring=2013-02-01/useridfirstchar=2). Using the point above, one can dump multiple Avro files into one such leaf directory of a PARTITIONed table.
However, one cannot just merely create partition directories in /user/hive/warehouse/<tablename> because the partition names also need to be added to the Hive metastore, which can be done with HiveQL via either the Hive CLI or JDBC:
```
ALTER TABLE mytable ADD IF NOT EXISTS PARTITION (ds='2013-02-01')
```
Also, note that contrary to examples in Apache documentation, "date" is not a valid partition name as it is a reserved word. "ds" (for "date string") is a popular alternative.
Hive supports SQL1999-style nested structures and nested arrays out of the box, and so is a good match for highly-structured Avro schemas. E.g., below is a valid Hive query against the example schema and CREATE TABLE from above:
```
SELECT messages[0].functionname FROM event
```
Hive also has functions for array handling such as "explode", which returns array elements from a single column as multiple result set rows. https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-explode
Not specific to Avro, but SHOW CREATE TABLE is not supported in Hive until 0.10. To do so in Hive 0.9, download HiveShowCreateTable.jar from https://issues.apache.org/jira/browse/HIVE-967 and then
```
hive --service jar HiveShowCreateTable.jar com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D table.name=mytable 2>/dev/null
```

Thursday, January 24, 2013

ElephantBird now enables Hello World from Pig

In learning Apache Pig, I was surprised at how difficult it is to write "Hello World." From http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants I would have thought the code below would have been legal:

-- Illegal Pig syntax
A = {('Hello'),('World')};
DUMP A;

However, it produces syntax errors. The only way to create a relation -- the basic "variable" in Pig -- is through LOADing a file. As a "language lawyer" within whatever team I was in dating back to the K&R era, it bothers me that a "Hello World" program is impossible to write within a single code file. It makes it seem like Pig Latin is an incomplete language.

The people at Twitter's ElephantBird project have come up with a custom solution in response to my request on the Pig User mailing list.
http://mail-archives.apache.org/mod_mbox/pig-user/201301.mbox/%3CCAE7pYjZtwuxYZs6Ov54P-6SFRCkKPuL9Jwac9i-Rr%2BYsdhasNw%40mail.gmail.com%3E

This ElephantBird Java class allows converting what normally would be the filename specified with the LOAD command into a tuple. A hack. That works. But not without invoking code not distributed with Pig and not without ugliness.

languages = load 'en,fr,jp' using LocationAsTuple(',');

Sunday, January 20, 2013

SQL HAVING in R

Coming from a SQL background, I'm learning R, and wanted to do the equivalent of GROUP BY HAVING (well, really embedded in a subquery in order to subset the data), but the most obvious Google searches turned up nothing. The answer is probably a no-brainer for R experts, but here it is in case future R-novices-SQL-experts Google for it.

Taking the example data set chickwts,

> data(chickwts)
> summary(chickwts)
    weight         feed
Min.   :108.0   casein   :12
1st Qu.:204.5   horsebean:10
Median :258.0 linseed :12
Mean   :261.3 meatmeal :11
3rd Qu.:323.5 soybean  :14
Max. :423.0 sunflower:12

and supposing we want to exclude "low" popularity feeds that occur in the data set fewer than 12 times (yes, this is a contrived example), the below will discard those low-popularity feeds, leaving only the high-popularity feeds.

x <- sapply(split(chickwts,chickwts$feed), nrow)
chickwts <- chickwts[chickwts$feed %in% names(x[x>=12]),]

Thursday, January 3, 2013

Laravel PHP/MySQL/CentOS garbled strings

A quick but obscure tidbit for something that consumed my day today (reported here only because it wasn't reported anywhere else on the web):

If you're using the Laravel PHP framework with MySQL running on CentOS, you probably need to change the charset in application/config/database.php to be "latin1" instead of Laravel's example of "utf8". Otherwise, you'll get garbled strings.

CentOS (even the current 6.3 distribution) comes with MySQL 5.1, which is from before Oracle acquired it and modernized it (e.g. with things like foreign key constraints). Another modernization is that whereas MySQL 5.1 defaults to the latin1 character set, its successor MySQL 5.5 defaults to the utf8 character set. Laravel's example database.php connection specifies utf8, so unless you've manually upgraded CentOS to MySQL 5.5, you will need to change database.php to specify the latin1 character set.