Tuesday, April 30, 2013

Simulating theta joins in Hive

Hive famously only supports "equijoins" (equality of columns between two tables) and not "theta joins" (inequalities). Starting with Hive 0.10, a theta join can be simulated with a cross join. Instead of
-- Does not work!
SELECT a.x
FROM a
JOIN b
ON a.y <= b.y
One can do
-- Hive 0.10 and later
SELECT a.x
FROM a
CROSS JOIN b
WHERE a.y <= b.y
With CROSS JOIN being new to Hive 0.10, for earlier versions of Hive, the workaround is to create additional dummy columns in tables a and b called "one", populated all with the number 1. Then it becomes:
-- Hive 0.9 and earlier
SELECT a.x
FROM a
JOIN b
ON a.one=b.one
WHERE a.y <= b.y
Now, a cross join of course requires a lot of space, the product of the table sizes to be exact. It works fine if that product is in the millions, but if it's in the billions or more, YMMV.

Wednesday, April 24, 2013

Presentation on Spark

Last night I gave a presentation on Spark, the exciting replacement for Hadoop's Map/Reduce that's come out of Berkeley and not tied to any particular distribution of Hadoop. Slides and video are below (video is cued to start at my part at 12:25, and it goes for about 35 minutes).