Thursday, August 29, 2013

Selecting Pandas data with list comprehension

Given a Dataframe, the Pandas "ix" method allows you to "query" the Dataframe with a condition that resembles a SQL WHERE clause:

df=DataFrame({'rank':[70,21,1000000],
              'domain':['www.cnn.com','www.msn.com','s.down.bad']})
df
domainrank
0www.cnn.com70
1www.msn.com21
2s.down.bad1000000
df.ix[df['rank']<100,:]
domainrank
0www.cnn.com70
1www.msn.com21

But even though "df['rank']<100" on its surface resembles a SQL WHERE clause, recall that the .ix method, like the R data frame after which it was patterned, isn't really taking a WHERE clause as its first parameter. It's taking an array of booleans. The df['rank']<100 is returning an array of booleans due to NumPy's broadcasting rules.

However, NumPy doesn't support every possible operator and function. For example, Numpy does not have string functions, since it is, after all, a numeric library. For composing Pandas dataframe selections based on string functions, we can use Python "list comprehension" to generate a list of booleans (which the ix method will accept as its first indexing parameter). A Pandas equivalent of SQL

SELECT *
FROM   df
WHERE  domain LIKE '%s%'

might be

df.ix[['s' in x for x in df['domain']],:]
domainrank
1www.msn.com21
2s.down.bad1000000

But as of Pandas 0.8.1 (released in 2012), Pandas supports operations on vectors of strings, similar to NumPy via str. Using str, the above Python list comprehension can be eliminated and replaced with the more simple alternative below. The .str.contains returns the array of booleans that .ix needs.

df.ix[df['domain'].str.contains('s'),:]

So Python list comprehension is not needed for simple numeric conditions (due to NumPy's broadcasting) or simple string conditions (due to Pandas string vectorization). More complex conditions, though, may still require Python list comprehension. As an example, we can query rows from the above data frame where only those websites are currently up:

import urllib2
def isup(domain):
    try:
        con = urllib2.urlopen(urllib2.Request('http://'+domain))
        return con.getcode() == 200
    except:
        return False

df.ix[[isup(x) for x in df['domain']],:]
domainrank
0www.cnn.com70
1www.msn.com21

No comments: