Tuesday, February 19, 2013

Hive STRUCT tips

Hive supports SQL1999-style STRUCTs (and ARRAYs), whether they be native Hive tables, Avro-backed tables, or other SerDe.

Tip 1: INSERT

While querying is straightforward using the dot notation familiar from Java, etc.
SELECT z.a FROM oc;
INSERTing requires use of the peculiar named_struct:
echo 1,2 >twovalues.csv
hive
CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE tc;
CREATE TABLE oc (z STRUCT<a: int, b: int>);
INSERT INTO TABLE oc SELECT named_struct('a', x, 'b', y) FROM tc;

Tip 2: NULLs

HiveQL DDL has no option in its CREATE TABLE syntax to specify that a column be NULLable. That's because in the Hadoop/HDFS/Big Data philosophy, data is in general allowed to be missing, and that translates nicely to the SQL concept of NULL in Hive/HiveQL.

However, that doesn't extend to STRUCTs in HiveQL. Apparently neither columns within STRUCTs nor the STRUCTs themselves can be NULLs.

Thus there is this odd restriction: while all regular Hive columns must be NULLable, all STRUCT Hive columns must be non-NULLable. That's quite a polarization of behavior for a DDL that doesn't even have a keyword "NULL".

The following:
INSERT INTO TABLE oc SELECT null FROM tc;
produces the error
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from void to struct<a:int,b:int>.
And the following:
INSERT INTO TABLE oc SELECT named_struct('a', null, 'b', null) FROM tc;
produces the error:
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from struct<a:void,b:void> to struct<a:int,b:int>.

1 comment:

intransit said...

You can actually do this, it's just a major pain in the back side:

INSERT INTO TABLE oc SELECT named_struct('a', cast(null as int), 'b', cast(null as int)) FROM tc;