Technical Tidbit of the Day: July 2013

Monday, July 29, 2013

Choropleth in D3.js and Pandas (IPython Notebook)

UPDATE 2014-06-08: This post is outdated as it is for IPython Notebook 1.0. Please see GeoSparkGrams: Tiny histograms on map with IPython Notebook and d3.js for IPython Notebook 2.0.

A lot of people have done a mash-up of D3 with IPython Notebook. Some efforts created a floating overlay over the Notebook rather than creating the output in the standard Notebook inline format. More recent efforts have utilized the Notebook's publish_html() to generate the output inline. One of the more advanced such efforts, ipyD3, however, works only on Windows. I've forked his gist and modified the couple of lines to make it Mac compatible. There is a small chance it's still Windows-compatible with my changes, but I haven't tested it. I'm almost certain the changes allow it to work on Linux too, but again, I've only tested it on a Mac.
I posted a notebook that generates the Choropleth below.

Besides demonstrating how to use D3 from IPython Notebook, it also demonstrates use of geographical maps in D3, itself not straightforward (or at least not built-in).
To transfer a Pandas Dataframe to ipyD3, I convert it to 2D Numpy array. In this particular example, I could have instead just converted the Dataframe to a dict and then passed a dict to ipyD3, since that is one of the data types it is able to marshal to the Javascript, but I wanted to show a more general approach of passing any Dataframe to ipyD3. Numpy arrays preserve column order, unlike quick examples I found on the web of converting Dataframes to JSON (which use non-order preserving dicts as an intermediate form), but at the expense of stripping out the column names. If your custom D3.js code needs column names, you'll have to pass that in as an additional Javascript variable.
The map shape data comes from Wikipedia, which has each state conveniently identified by its two-letter postal code for the SVG id and by the SVG class name of "state". The unemployment data is just something I found on GitHub.
Before executing this example, you'll need to download the ipyD3.py from my gist and put it in the same directory as where you launch IPython Notebook from.
Printing is a challenge. The "long paper" PDF technique below works, but only on the first inline ipyD3. The publish_html technique employed by ipyD3 is not foolproof; full-fledged D3.js support is not expected in IPython Notebook until version 2.0 (and version 1.0 isn't even out yet).
UPDATE 2013-07-30: Forgot to mention that you also need to install and download PhantomJS.
UPDATE 2013-08-06: I discovered it's possible to convert a Pandas Dataframe to a Numpy array directly with just array() and dropping the .to_records().tolist(). Doing so drops the first column, the row indexes, but those are not usually needed. If you modify my example to omit the .to_records().tolist(), you'll also need to reduce each of the hard-coded Javascript column indexes by 1.
UPDATE 2014-06-08: This post is outdated as it is for IPython Notebook 1.0. Please see GeoSparkGrams: Tiny histograms on map with IPython Notebook and d3.js for IPython Notebook 2.0.

Friday, July 19, 2013

On Mac, only Firefox can PDF without page breaks

An alternative to nbconvert for IPython Notebook is to specify a long custom page size, such as 8.5"x60". This will work for any web page, not just IPython Notebooks. But on a Mac, this option has been removed from Safari 6 and is not available on the current Chrome version. Firefox still lets you, however:

In the Firefox drop-down menu, select File->Page Setup->Paper Size->Manage Custom Sizes...
In the Custom Paper Sizes dialog, click the "+" button beneath the list box on the left to add a new custom paper size.
Click the newly created name to change the name to something meaningful.
Change the Paper Size Height to something like 60 inches and click OK.
As normal, from the Firefox menu select File->Print and then in the Print dialog click PDF->Save as PDF.

Thursday, July 18, 2013

SERDEPROPERTIES required for Avro in Hive 0.11

To specify an Avro-backed Hive table, the Apache Wiki and the Cloudera Avro documentation both prescribe specifying the Avro schema in TBLPROPERTIES. This is no longer supported in Hive 0.11. It is now necessary to use SERDEPROPERTIES:

CREATE EXTERNAL TABLE mytable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
  'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Otherwise if TBLPROPERTIES is used to specify the location of the Avro schema, Hive 0.11 won't be able to find it and the following exception will be thrown:

java.io.IOException(org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema)

Monday, July 15, 2013

Quick link to download nbconvert

As a quick update to my nbconvert post, the nbconvert code has been deleted from its github tip and merged into the main IPython project. The problem is, IPython 1.0 isn't out yet. To use nbconvert with the current IPython 0.13 release, you can download the last stand-alone nbconvert .zip archive from https://github.com/ipython/nbconvert/archive/173bb08dd86d02a7485801969c94d4816913cd09.zip.

UPDATE 2013-07-17: Announcement today on the ipython-dev mailing list that IPython 1.0 won't be released until early August.

Wednesday, July 3, 2013

How to install VS2012 Update 3

Update 3 for Microsoft Visual Studio 2012 was released on June 26, 2013, but installing it is obvious only via Internet-based install. I refuse to perform Internet-based installs for development tools, so that I can retain the ability to recompile my source code 10-20 years in the future, long after update servers have been shut down. Although an ISO for Update 2 was well-advertised, the ISO for Update 3, at this moment, is hidden and difficult to install. Below are the steps.

Download the Update 3 ISO from the non-advertised link http://go.microsoft.com/fwlink/?LinkId=301705.
Before executing the install, it may be necessary to update your Microsoft root certificates. This can be done via KB931125. Note that it may be necessary to visit that link twice -- once to give permissions and a second time to actually save/execute the needed rootsupd.exe.

Monday, July 1, 2013

nbconvert: PDF from iPython Notebook

UPDATE 2013-07-19: For details on the alternative to nbconvert briefly mentioned below, long custom PDF page sizes, see On Mac, only Firefox can PDF without page breaks.

UPDATE 2013-07-15: See Quick link to download nbconvert for updated download location.

The official iPython Notebook documentation states that to convert a notebook to PDF, you should use your browser's "Print to PDF" capability. The problem is that that chops charts and graphs in half due to PDF pagination (unless you are able to configure a custom PDF paper size e.g. 60 inches long).

A command-line utility, nbconvert, which will eventually be merged into IPython but is not yet, nicely converts Notebooks to PDF. It even includes nice instructions on installing on a Mac, but the instructions are only 98% complete. Below are the missing steps:

Download the nbconvert package as a Zip and unzip to your home directory.
Follow the steps from https://github.com/ipython/nbconvert/blob/173bb08dd86d02a7485801969c94d4816913cd09/README.rst, specifically:
1. pip install jinja2
2. pip install markdown
3. curl http://docutils.svn.sourceforge.net/viewvc/docutils/trunk/docutils/?view=tar > docutils.tgz
4. pip install -U docutils.tgz
5. pip install pygments
6. sudo easy_install -U sphinx
7. Install PanDoc via the installer http://code.google.com/p/pandoc/downloads/list
8. Install MacTex via the .pkg http://www.tug.org/mactex/

Execute to convert your .ipynb to .pdf:

export PYTHONPATH=~/nbconvert-master/nbconvert/utils
python ~/nbconvert-master/nbconvert1/nbconvert.py --format=pdf MyNotebook.ipynb