Thursday, August 29, 2013

Selecting Pandas data with list comprehension

Given a Dataframe, the Pandas "ix" method allows you to "query" the Dataframe with a condition that resembles a SQL WHERE clause:

df=DataFrame({'rank':[70,21,1000000],
              'domain':['www.cnn.com','www.msn.com','s.down.bad']})
df
domainrank
0www.cnn.com70
1www.msn.com21
2s.down.bad1000000
df.ix[df['rank']<100,:]
domainrank
0www.cnn.com70
1www.msn.com21

But even though "df['rank']<100" on its surface resembles a SQL WHERE clause, recall that the .ix method, like the R data frame after which it was patterned, isn't really taking a WHERE clause as its first parameter. It's taking an array of booleans. The df['rank']<100 is returning an array of booleans due to NumPy's broadcasting rules.

However, NumPy doesn't support every possible operator and function. For example, Numpy does not have string functions, since it is, after all, a numeric library. For composing Pandas dataframe selections based on string functions, we can use Python "list comprehension" to generate a list of booleans (which the ix method will accept as its first indexing parameter). A Pandas equivalent of SQL

SELECT *
FROM   df
WHERE  domain LIKE '%s%'

might be

df.ix[['s' in x for x in df['domain']],:]
domainrank
1www.msn.com21
2s.down.bad1000000

But as of Pandas 0.8.1 (released in 2012), Pandas supports operations on vectors of strings, similar to NumPy via str. Using str, the above Python list comprehension can be eliminated and replaced with the more simple alternative below. The .str.contains returns the array of booleans that .ix needs.

df.ix[df['domain'].str.contains('s'),:]

So Python list comprehension is not needed for simple numeric conditions (due to NumPy's broadcasting) or simple string conditions (due to Pandas string vectorization). More complex conditions, though, may still require Python list comprehension. As an example, we can query rows from the above data frame where only those websites are currently up:

import urllib2
def isup(domain):
    try:
        con = urllib2.urlopen(urllib2.Request('http://'+domain))
        return con.getcode() == 200
    except:
        return False

df.ix[[isup(x) for x in df['domain']],:]
domainrank
0www.cnn.com70
1www.msn.com21

Wednesday, August 21, 2013

Unsquish Pandas/Matplotlib bar chart x labels

For a line plot, Matplotlib intelligently chooses x axis ticks and labels. But for bar charts, it blindly tries to print one for each bar, regardless of how many bars there are or how small they are. This can result in labels overprinting each other.

To see an example and the corresponding solution, see my IPython Notebook Solving x axis overprinting on Pandas/Matplotlib bar charts on GitHub.

Before:

After:

Friday, August 9, 2013

Added PNG support to ipyD3: better nbconvert compatibility

Following the suggestion of the original author of ipyD3, I added PNG capability to my fork of his ipyD3. Then, selecting "png" instead of "html" generates output that appears nearly identical. Example invocation:

d3.render(mode=('show','png'))

The disadvantage of PNG over the former HTML rendering is that it precludes any possibility of mouse interaction and animation. The advantage is that nbconvert will convert multiple PNG D3 renderings in the same Notebook, whereas with the HTML renderings nbconvert seemed to give up after the first one. This is using the last version of nbconvert before it was merged into the IPython project. I have not tried the beta versions of IPython 1.0; I'm waiting for the Anaconda release.

Saturday, August 3, 2013

BootIt Bare Metal for testing semi-embedded systems

I use the term "semi-embedded system" to refer to a PC loaded with hardware, such as data acquisition A/D devices, motion controllers, digital I/O lines, etc. When I create installers for semi-embedded software I write, I like to make them as turnkey as possible. That means ensuring, through testing, that the installs work on fresh (and other not-so-fresh, but controlled and known) copies of Windows. BootIt Bare Metal (BIBM) by Terabyte Unlimited is indispensable for testing installs that include device drivers and other software that are not so easily uninstalled (sometimes from including very specialized software as sub-installs, which often do not come with clean uninstallers).

BIBM is a multi-booter like Grub and the built-in Windows boot menu, but so much more. It's also a partition editor like PartitionMagic or GParted, and a backup facility like Ghost. Below is a screenshot of my boot menu.

As you can see, the three operating systems I can boot from are: Windows 7 main, Ubuntu, and Windows 7 test. Below is a list of the partitions I've configured with BIBM.

But wait, how can there be so many partitions? Aren't you limited to just four? BIBM supports its own type of extended partitions (information about which are stored in that special BootIt EMBRM partition) and swaps them in and out of the regular max-four-partition MBR on the fly. That not only allows being able to multi-boot a large number of different operating systems (e.g. XP, W7, W8, Ubuntu 12, Ubuntu 13, etc.), it also enables the testing of installs. As you can see from the list of partitions, I keep clean copies of Windows 7 at the end of the hard drive -- specifically, one that is just a super-fresh install from the DVD, and the other that has Windows Updates run on it.

To test an install, I just copy and paste one of those saved partitions to the remaining blank area of the hard drive (denoted by BIBM with the line of hyphens "---"). I can do this iteratively to develop and debug an install that includes driver installs as sub-installs, without fear that I get "only one shot" to test it on fresh computer.

Even if an install doesn't involve drivers, this technique is useful from a licensing perspective. Windows 7 requires separate licenses for virtual machines, because Microsoft explicitly considers virtual machines to be separate machines. But I have not found where Microsoft forbids making backup copies on extra hard-drive partitions on a single machine. I am not a lawyer, so do not consider this to be legal advice that it is permissible to do so with a single license. But for VMs it is well-known you must have separate licenses.

Windows 8 and UEFI

Sadly, this technique is threatened with the advent of UEFI SecureBoot and Windows 8. BIBM still works, but the BIOS and OS must support a "legacy mode". It is reasonable to expect that "legacy mode" will become more rare in the future. TeraByte Unlimited is silent about whether it will support or bypass UEFI in the future.

Tips on installing Linux under BIBM

Installing Linux under BIBM requires a specific set of steps. The most important thing to remember is to install Grub onto the Linux partition (e.g. dev/sda2) rather than the MBR (e.g. dev/sda).

Additionally, the automatic install of Grub can sometimes fail. In that case, it is necessary to follow Terabyte Unlimited's instructions on manually installing Grub. If that doesn't work, it may be necessary to live-boot with the help of a Linux DVD into the Linux installation on your hard drive, and then reinstall the grub reinstaller:

sudo apt-get install --reinstall grub-pc