Apache Avro is a binary compressed fixed-schema file format meant for Hadoop. As of Hive 0.9.1 and 0.10, Hive can read Avro files directly. Cloudera also back-ported Avro compatibility for its Cloudera Hive 0.9.0.
Avro is highly structured, supporting nested records and arrays, and that makes it a good match for Hive, whose HiveQL syntax adopted SQL1999-style records/structures and arrays.
Avro is a great space-saving alternative to JSON, especially since it's not possible for Apache Pig to read gz-compressed JSON.
This is a collection of tips on using Avro -- most of them can be found here and there on the web, but here they all are in one place.
Schema tips
Below is an example schema that shows nested structures, nullable fields, optional sub-structures, arrays of structures, and arrays of strings.
{"name": "event", "type": "record", "fields": [
{"name": "logLevel", "type": ["null", "int"], "default": null},
{"name": "details", "type":
{"type": "record", "name": "detailstype", "fields": [
{"name": "eventname", "type": ["null", "string"], "default": null},
]}
},
{"name": "metadata", "type": ["null",
{"type": "record", "name": "metadatatype", "fields": [
{"name": "referrer", "type": "string"},
]}]
},
{"name": "messages", "type": ["null",
{"type": "array", "name": "messagestype", "items":
{"name": "message", "type": "string"}
}],
"default": null
}
]}
- Declaring a 1-to-1 sub-record requires the verbose style above, where the "type" is declared to again to be a "type", which is of "record".
- In order for a field to be "nullable", it must be declared as a UNION with type "null" as shown above for "type" using the JSON array syntax.
- In order to avoid having to explicitly set a field when writing out an Avro record, it must be declared as "default": null
- Combining the above two points, in order to declare a field nullable and default: null, the "null" in the UNION must appear first because it is the first type in a UNION list that determines the data type for the default, and "null" is an invalid default for all data types (e.g. "string", "int", etc.) with the exception of the "null" data type.
- The above applies to sub-records as well. In the example above, "details" is a required sub-record (with an optional field "eventname"), whereas "metadata" is an optional sub-record (with, if the sub-record is present, a required field "referrer").
- When writing out sub-records using the "generic" schema rather than the "specific" schema, the Java code to instantiate an empty sub-record is
GenericData.Record childdatum =
new GenericData.Record(schema.getField(fieldname).schema());
However, if the sub-record is optional (nullable), then to dig past the UNION, the Java code becomes
GenericData.Record childdatum =
new GenericData.Record(schema.getField(fieldname).schema().getTypes().get(1));
In both cases, the childdatum is populated and added to the parentdatum via
childdatum.put("aChildStringField", "A string value");
parentdatum.put(fieldname, childdatum);
- Appending to existing Avro files in HDFS is not supported, even though appending to existing Avro files on the local filesystem and appending to existing non-Avro HDFS files are both supported. The open one-year-old unassigned JIRA ticket for such a feature is AVRO-1035.
- To view a .avro file:
java -jar avro-tools-1.7.3.jar tojson mydata.avro
Hive using Avro
Example schema:
CREATE EXTERNAL TABLE mytable
ROW FORMAT SERDE
'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED as INPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT
'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
TBLPROPERTIES (
'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc');
[
UPDATE 2013-07-18: The above Hive table creation TBLPROPERTIES no longer works with Hive 0.11. Instead one must use
SERDEPROPERTIES.]