Tuesday, April 30, 2013

Simulating theta joins in Hive

Hive famously only supports "equijoins" (equality of columns between two tables) and not "theta joins" (inequalities). Starting with Hive 0.10, a theta join can be simulated with a cross join. Instead of
-- Does not work!
SELECT a.x
FROM a
JOIN b
ON a.y <= b.y
One can do
-- Hive 0.10 and later
SELECT a.x
FROM a
CROSS JOIN b
WHERE a.y <= b.y
With CROSS JOIN being new to Hive 0.10, for earlier versions of Hive, the workaround is to create additional dummy columns in tables a and b called "one", populated all with the number 1. Then it becomes:
-- Hive 0.9 and earlier
SELECT a.x
FROM a
JOIN b
ON a.one=b.one
WHERE a.y <= b.y
Now, a cross join of course requires a lot of space, the product of the table sizes to be exact. It works fine if that product is in the millions, but if it's in the billions or more, YMMV.

2 comments:

Unknown said...
This comment has been removed by the author.
Unknown said...

I misread your CROSS JOIN statement originally, and didn't note the WHERE instead of another JOIN condition. This works perfectly well; thanks!