Given a Dataframe, the Pandas "ix" method allows you to "query" the Dataframe with a condition that resembles a SQL WHERE clause:
df=DataFrame({'rank':[70,21,1000000], 'domain':['www.cnn.com','www.msn.com','s.down.bad']}) df
domain | rank | |
0 | www.cnn.com | 70 |
1 | www.msn.com | 21 |
2 | s.down.bad | 1000000 |
df.ix[df['rank']<100,:]
domain | rank | |
0 | www.cnn.com | 70 |
1 | www.msn.com | 21 |
But even though "df['rank']<100" on its surface resembles a SQL WHERE clause, recall that the .ix method, like the R data frame after which it was patterned, isn't really taking a WHERE clause as its first parameter. It's taking an array of booleans. The df['rank']<100 is returning an array of booleans due to NumPy's broadcasting rules.
However, NumPy doesn't support every possible operator and function. For example, Numpy does not have string functions, since it is, after all, a numeric library. For composing Pandas dataframe selections based on string functions, we can use Python "list comprehension" to generate a list of booleans (which the ix method will accept as its first indexing parameter). A Pandas equivalent of SQL
SELECT * FROM df WHERE domain LIKE '%s%'
might be
df.ix[['s' in x for x in df['domain']],:]
domain | rank | |
1 | www.msn.com | 21 |
2 | s.down.bad | 1000000 |
But as of Pandas 0.8.1 (released in 2012), Pandas supports operations on vectors of strings, similar to NumPy via str. Using str, the above Python list comprehension can be eliminated and replaced with the more simple alternative below. The .str.contains returns the array of booleans that .ix needs.
df.ix[df['domain'].str.contains('s'),:]
So Python list comprehension is not needed for simple numeric conditions (due to NumPy's broadcasting) or simple string conditions (due to Pandas string vectorization). More complex conditions, though, may still require Python list comprehension. As an example, we can query rows from the above data frame where only those websites are currently up:
import urllib2 def isup(domain): try: con = urllib2.urlopen(urllib2.Request('http://'+domain)) return con.getcode() == 200 except: return False df.ix[[isup(x) for x in df['domain']],:]
domain | rank | |
0 | www.cnn.com | 70 |
1 | www.msn.com | 21 |