Thursday, June 13, 2013

Query Hive from iPython Notebook

iPython Notebook together with pandas comprise a data analysis system that is essentially a clone of R. The advantage over R is that Python code can be more easily converted into production code and executed, for example, on a web server.

One limitation is that the Apache-recommended Python interface to Hive requires installing Hive locally, which can be problematic or inconvenient from, say, a laptop used for analysis. A nice alternative is remotely executing Hive on the Hadoop cluster (or adjacent Linux server with access to the cluster). The alternative does require installing sshpass. At the end of the code below, the query results are stored in a pandas DataFrame with column names and automatic detection of data types.

from pandas import *
from StringIO import StringIO

s = "sshpass -f myfilewithpassword ssh myusername@myhostname \"hive -S -e \\\"" \
"set hive.cli.print.header=true;" \
"SELECT * from mytable;\\\"\""

t = !$s
df = read_csv(StringIO(t.n), sep='\t')

UPDATE 2013-08-01: It's a bit cleaner if one uses Python's triple-quoting mechanism as below. This allows one to copy and paste queries between iPython Notebook and Hive and Hue without having to reformat with quotes and backslashes.

from pandas import *
from StringIO import StringIO

s = """
sshpass -f myfilewithpassword ssh myusername@myhostname \"
hive -S -e \\\"
set hive.cli.print.header=true;
SELECT * from mytable;
\\\"\"
"""

t = !$s
df = read_csv(StringIO(t.n), sep='\t')

7 comments:

Jeffrey W. McLellan said...

This is definitely a useful post, I'm trying to understand how "!$s" works both directly at the ipython prompt and in a scripting environment, but can't find any documentation on it.

Any ideas where to look?

vjparikh said...

this thing doesnt work for me?

when i enter df to see the output - i get an error saying:-

The system cannot find the path specified

Does the sshpass need to be installed on client side or on the server side. If server side does it need to be installed where Hive is located? On client side how do i install sshpass if I am using windows 7

Michael Malak said...

vjparikh:

I don't know of an sshpass for Windows. I did this on a MacBook, and presumably of course it also works on Linux.

You could create a Linux VM (e.g. using VirtualBox) and do all your work from within the VM.

Jeffrey:

Sorry I didn't notice your post earlier. Although not official documentation, this page describes using bang (exclamation mark) to execute system shell commands:

http://www.math.uiuc.edu/~gfrancis/illimath/windows/aszgard_mini/doc/IPython81Manual/node6.html#SECTION00067000000000000000

vjparikh said...

thank you michael for your prompt response!

so that means i need to install sshpass on my local machine. is that right? no installation is required on the hadoop / hive side, correct?

Michael Malak said...

Correct -- nothing to install on the server side

vjparikh said...

thank you!

last question:-

why does the last line have df = read_csv?

why read_csv? the data in hive will be in native sql database correct?

am i missing something?

thank you very much! :)

Michael Malak said...

It's to parse the stdout from running the hive command.

You have to understand: this is merely a trick to get Hive data into iPython Notebook. It's not what I would call a first-class mechanism.