Tuesday, February 19, 2013

Hive STRUCT tips

Hive supports SQL1999-style STRUCTs (and ARRAYs), whether they be native Hive tables, Avro-backed tables, or other SerDe.


While querying is straightforward using the dot notation familiar from Java, etc.
INSERTing requires use of the peculiar named_struct:
echo 1,2 >twovalues.csv
CREATE TABLE oc (z STRUCT<a: int, b: int>);
INSERT INTO TABLE oc SELECT named_struct('a', x, 'b', y) FROM tc;

Tip 2: NULLs

HiveQL DDL has no option in its CREATE TABLE syntax to specify that a column be NULLable. That's because in the Hadoop/HDFS/Big Data philosophy, data is in general allowed to be missing, and that translates nicely to the SQL concept of NULL in Hive/HiveQL.

However, that doesn't extend to STRUCTs in HiveQL. Apparently neither columns within STRUCTs nor the STRUCTs themselves can be NULLs.

Thus there is this odd restriction: while all regular Hive columns must be NULLable, all STRUCT Hive columns must be non-NULLable. That's quite a polarization of behavior for a DDL that doesn't even have a keyword "NULL".

The following:
produces the error
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from void to struct<a:int,b:int>.
And the following:
INSERT INTO TABLE oc SELECT named_struct('a', null, 'b', null) FROM tc;
produces the error:
FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from struct<a:void,b:void> to struct<a:int,b:int>.

Thursday, February 7, 2013

Avro can append in HDFS after all

As an update to the previous post, it turns out Avro can append to an HDFS file after all. The capability is so obscure that even the experts on the Apache Avro mailing list weren't sure of the succinct way of accomplishing it. Sample code from GitHub.

But the demand for this capability will only grow, now that Hive supports Avro directly as of the Hive 0.10 release in January.

Tuesday, February 5, 2013

Avro Tips

Apache Avro is a binary compressed fixed-schema file format meant for Hadoop. As of Hive 0.9.1 and 0.10, Hive can read Avro files directly. Cloudera also back-ported Avro compatibility for its Cloudera Hive 0.9.0.

Avro is highly structured, supporting nested records and arrays, and that makes it a good match for Hive, whose HiveQL syntax adopted SQL1999-style records/structures and arrays.

Avro is a great space-saving alternative to JSON, especially since it's not possible for Apache Pig to read gz-compressed JSON.

This is a collection of tips on using Avro -- most of them can be found here and there on the web, but here they all are in one place.

Schema tips

Below is an example schema that shows nested structures, nullable fields, optional sub-structures, arrays of structures, and arrays of strings.
{"name": "event", "type": "record", "fields": [
  {"name": "logLevel", "type": ["null", "int"], "default": null},
  {"name": "details", "type": 
    {"type": "record", "name": "detailstype", "fields": [
      {"name": "eventname", "type": ["null", "string"], "default": null},
  {"name": "metadata", "type": ["null",
    {"type": "record", "name": "metadatatype", "fields": [
      {"name": "referrer", "type": "string"},
  {"name": "messages",   "type": ["null",
    {"type": "array", "name": "messagestype", "items": 
      {"name": "message", "type": "string"}
    "default": null
  • Declaring a 1-to-1 sub-record requires the verbose style above, where the "type" is declared to again to be a "type", which is of "record".
  • In order for a field to be "nullable", it must be declared as a UNION with type "null" as shown above for "type" using the JSON array syntax.
  • In order to avoid having to explicitly set a field when writing out an Avro record, it must be declared as "default": null
  • Combining the above two points, in order to declare a field nullable and default: null, the "null" in the UNION must appear first because it is the first type in a UNION list that determines the data type for the default, and "null" is an invalid default for all data types (e.g. "string", "int", etc.) with the exception of the "null" data type.
  • The above applies to sub-records as well. In the example above, "details" is a required sub-record (with an optional field "eventname"), whereas "metadata" is an optional sub-record (with, if the sub-record is present, a required field "referrer").
  • When writing out sub-records using the "generic" schema rather than the "specific" schema, the Java code to instantiate an empty sub-record is
    GenericData.Record childdatum =
    new GenericData.Record(schema.getField(fieldname).schema());
    However, if the sub-record is optional (nullable), then to dig past the UNION, the Java code becomes
    GenericData.Record childdatum =
    new GenericData.Record(schema.getField(fieldname).schema().getTypes().get(1));
    In both cases, the childdatum is populated and added to the parentdatum via
    childdatum.put("aChildStringField", "A string value");
    parentdatum.put(fieldname, childdatum);
  • Appending to existing Avro files in HDFS is not supported, even though appending to existing Avro files on the local filesystem and appending to existing non-Avro HDFS files are both supported. The open one-year-old unassigned JIRA ticket for such a feature is AVRO-1035.
  • To view a .avro file:
    java -jar avro-tools-1.7.3.jar tojson mydata.avro

Hive using Avro

Example schema:
[UPDATE 2013-07-18: The above Hive table creation TBLPROPERTIES no longer works with Hive 0.11. Instead one must use SERDEPROPERTIES.]
  • Although Apache documentation gives examples of 'avro.schema.url'='file:///home/...' that actually doesn't work for any non-trivial Hive query because map/reduce doesn't have access to the local filesystem.
  • It's not required to specify LOCATION for where the external Hive data will be stored. If it is not specified, then Hive creates the directory in HDFS of /user/hive/warehouse/<tablename>
  • It is possible to arbitrarily dump a bunch of Avro files into a Hive table directory, and Hive will automatically read them all when executing a query.
  • Not specific to Avro, but it is possible to specify one or more levels of PARTITION for a Hive table. Each level of PARTITION corresponds to another level of directory underneath /user/hive/warehouse/ (e.g. for a two-level PARTITION one might have a directory /user/hive/warehouse/mytable/datestring=2013-02-01/useridfirstchar=2). Using the point above, one can dump multiple Avro files into one such leaf directory of a PARTITIONed table.
    However, one cannot just merely create partition directories in /user/hive/warehouse/<tablename> because the partition names also need to be added to the Hive metastore, which can be done with HiveQL via either the Hive CLI or JDBC:
    ALTER TABLE mytable ADD IF NOT EXISTS PARTITION (ds='2013-02-01')
    Also, note that contrary to examples in Apache documentation, "date" is not a valid partition name as it is a reserved word. "ds" (for "date string") is a popular alternative.
  • Hive supports SQL1999-style nested structures and nested arrays out of the box, and so is a good match for highly-structured Avro schemas. E.g., below is a valid Hive query against the example schema and CREATE TABLE from above:
    SELECT messages[0].functionname FROM event
    Hive also has functions for array handling such as "explode", which returns array elements from a single column as multiple result set rows. https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-explode
  • Not specific to Avro, but SHOW CREATE TABLE is not supported in Hive until 0.10. To do so in Hive 0.9, download HiveShowCreateTable.jar from https://issues.apache.org/jira/browse/HIVE-967 and then
    hive --service jar HiveShowCreateTable.jar com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D table.name=mytable 2>/dev/null