Technical Tidbit of the Day: Selecting Pandas data with list comprehension

Given a Dataframe, the Pandas "ix" method allows you to "query" the Dataframe with a condition that resembles a SQL WHERE clause:

df=DataFrame({'rank':[70,21,1000000],
              'domain':['www.cnn.com','www.msn.com','s.down.bad']})
df

	domain	rank
0	www.cnn.com	70
1	www.msn.com	21
2	s.down.bad	1000000

df.ix[df['rank']<100,:]

	domain	rank
0	www.cnn.com	70
1	www.msn.com	21

But even though "df['rank']<100" on its surface resembles a SQL WHERE clause, recall that the .ix method, like the R data frame after which it was patterned, isn't really taking a WHERE clause as its first parameter. It's taking an array of booleans. The df['rank']<100 is returning an array of booleans due to NumPy's broadcasting rules.

However, NumPy doesn't support every possible operator and function. For example, Numpy does not have string functions, since it is, after all, a numeric library. For composing Pandas dataframe selections based on string functions, we can use Python "list comprehension" to generate a list of booleans (which the ix method will accept as its first indexing parameter). A Pandas equivalent of SQL

SELECT *
FROM   df
WHERE  domain LIKE '%s%'

might be

df.ix[['s' in x for x in df['domain']],:]

	domain	rank
1	www.msn.com	21
2	s.down.bad	1000000

But as of Pandas 0.8.1 (released in 2012), Pandas supports operations on vectors of strings, similar to NumPy via str. Using str, the above Python list comprehension can be eliminated and replaced with the more simple alternative below. The .str.contains returns the array of booleans that .ix needs.

df.ix[df['domain'].str.contains('s'),:]

So Python list comprehension is not needed for simple numeric conditions (due to NumPy's broadcasting) or simple string conditions (due to Pandas string vectorization). More complex conditions, though, may still require Python list comprehension. As an example, we can query rows from the above data frame where only those websites are currently up:

import urllib2
def isup(domain):
    try:
        con = urllib2.urlopen(urllib2.Request('http://'+domain))
        return con.getcode() == 200
    except:
        return False

df.ix[[isup(x) for x in df['domain']],:]

	domain	rank
0	www.cnn.com	70
1	www.msn.com	21

Technical Tidbit of the Day

Thursday, August 29, 2013

Selecting Pandas data with list comprehension

No comments: