Technical Tidbit of the Day: Hive

Showing posts with label Hive. Show all posts

Thursday, July 18, 2013

SERDEPROPERTIES required for Avro in Hive 0.11

To specify an Avro-backed Hive table, the Apache Wiki and the Cloudera Avro documentation both prescribe specifying the Avro schema in TBLPROPERTIES. This is no longer supported in Hive 0.11. It is now necessary to use SERDEPROPERTIES:

CREATE EXTERNAL TABLE mytable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  WITH SERDEPROPERTIES (
  'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc')
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat';

Otherwise if TBLPROPERTIES is used to specify the location of the Avro schema, Hive 0.11 won't be able to find it and the following exception will be thrown:

java.io.IOException(org.apache.hadoop.hive.serde2.avro.AvroSerdeException: Neither avro.schema.literal nor avro.schema.url specified, can't determine table schema)

Thursday, June 13, 2013

Query Hive from iPython Notebook

iPython Notebook together with pandas comprise a data analysis system that is essentially a clone of R. The advantage over R is that Python code can be more easily converted into production code and executed, for example, on a web server.

One limitation is that the Apache-recommended Python interface to Hive requires installing Hive locally, which can be problematic or inconvenient from, say, a laptop used for analysis. A nice alternative is remotely executing Hive on the Hadoop cluster (or adjacent Linux server with access to the cluster). The alternative does require installing sshpass. At the end of the code below, the query results are stored in a pandas DataFrame with column names and automatic detection of data types.

from pandas import *
from StringIO import StringIO

s = "sshpass -f myfilewithpassword ssh myusername@myhostname \"hive -S -e \\\"" \
"set hive.cli.print.header=true;" \
"SELECT * from mytable;\\\"\""

t = !$s
df = read_csv(StringIO(t.n), sep='\t')

UPDATE 2013-08-01: It's a bit cleaner if one uses Python's triple-quoting mechanism as below. This allows one to copy and paste queries between iPython Notebook and Hive and Hue without having to reformat with quotes and backslashes.

from pandas import *
from StringIO import StringIO

s = """
sshpass -f myfilewithpassword ssh myusername@myhostname \"
hive -S -e \\\"
set hive.cli.print.header=true;
SELECT * from mytable;
\\\"\"
"""

t = !$s
df = read_csv(StringIO(t.n), sep='\t')

Tuesday, May 28, 2013

New in Hive 0.11: Windowing and analytics functions

It's not getting a lot of press, but Hive 0.11, released May 15, 2013, introduced a ton of new keywords. Hive 0.11 supports the SQL analytic functions similar to what popular RDBMS's introduced a decade ago with the SQL2003 standard, with the following new Hive keywords:

LEAD
LAG
FIRST_VALUE
LAST_VALUE
OVER
WINDOW
UNBOUNDED
PRECEDING
FOLLOWING
RANK
ROW_NUMBER
DENSE_RANK
CUME_DIST
PERCENT_RANK
NTILE

Sadly, the documentation is scanty, to say the least. For now, documentation and tutorials (e.g. sqlatoms.com) on equivalent keywords in Oracle and SQL Server, for example, will have to suffice.

Being able to window obviates some of the need for subqueries (specifically, pre-SQL2003 subqueries that just provided a limited GROUP BY in a separate SELECT column), which of course Hive does not support.

Tuesday, April 30, 2013

Simulating theta joins in Hive

Hive famously only supports "equijoins" (equality of columns between two tables) and not "theta joins" (inequalities). Starting with Hive 0.10, a theta join can be simulated with a cross join. Instead of

-- Does not work!
SELECT a.x
FROM a
JOIN b
ON a.y <= b.y

One can do

-- Hive 0.10 and later
SELECT a.x
FROM a
CROSS JOIN b
WHERE a.y <= b.y

With CROSS JOIN being new to Hive 0.10, for earlier versions of Hive, the workaround is to create additional dummy columns in tables a and b called "one", populated all with the number 1. Then it becomes:

-- Hive 0.9 and earlier
SELECT a.x
FROM a
JOIN b
ON a.one=b.one
WHERE a.y <= b.y

Now, a cross join of course requires a lot of space, the product of the table sizes to be exact. It works fine if that product is in the millions, but if it's in the billions or more, YMMV.

Monday, March 4, 2013

Fixing Hive after Cloudera 4.20 update

The Cloudera 4.20 update breaks Hive, producing errors such as

*org.datanucleus.store.rdbms.exceptions.MissingTableException: Required
table missing : "`SKEWED_STRING_LIST`" in Catalog "" Schema "". DataNucleus
requires this table to perform its persistence operations. Either your
MetaData is incorrect, or you need to enable "datanucleus.autoCreateTables")

It's because the deployment script doesn't call the Hive metastore upgrade script.

To manually upgrade the Hive metastore from the 0.9 schema to 0.10 (assuming you are using MySQL for the Hive metastore as recommended by Cloudera):

cd /usr/lib/hive/scripts/metastore/upgrade/mysql
sudo mysql --user=root
use metastore;
source upgrade-0.9.0-to-0.10.0.mysql.sql

Tuesday, February 19, 2013

Hive STRUCT tips

Hive supports SQL1999-style STRUCTs (and ARRAYs), whether they be native Hive tables, Avro-backed tables, or other SerDe.

Tip 1: INSERT

While querying is straightforward using the dot notation familiar from Java, etc.

SELECT z.a FROM oc;

INSERTing requires use of the peculiar named_struct:

echo 1,2 >twovalues.csv
hive
CREATE TABLE tc (x INT, y INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',';
LOAD DATA LOCAL INPATH 'twovalues.csv' INTO TABLE tc;
CREATE TABLE oc (z STRUCT<a: int, b: int>);
INSERT INTO TABLE oc SELECT named_struct('a', x, 'b', y) FROM tc;

Tip 2: NULLs

HiveQL DDL has no option in its CREATE TABLE syntax to specify that a column be NULLable. That's because in the Hadoop/HDFS/Big Data philosophy, data is in general allowed to be missing, and that translates nicely to the SQL concept of NULL in Hive/HiveQL.

However, that doesn't extend to STRUCTs in HiveQL. Apparently neither columns within STRUCTs nor the STRUCTs themselves can be NULLs.

Thus there is this odd restriction: while all regular Hive columns must be NULLable, all STRUCT Hive columns must be non-NULLable. That's quite a polarization of behavior for a DDL that doesn't even have a keyword "NULL".

The following:

INSERT INTO TABLE oc SELECT null FROM tc;

produces the error

FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from void to struct<a:int,b:int>.

And the following:

INSERT INTO TABLE oc SELECT named_struct('a', null, 'b', null) FROM tc;

produces the error:

FAILED: SemanticException [Error 10044]: Line 1:18 Cannot insert into target
table because column number/types are different 'oc': Cannot convert column 0
from struct<a:void,b:void> to struct<a:int,b:int>.

Tuesday, February 5, 2013

Avro Tips

Apache Avro is a binary compressed fixed-schema file format meant for Hadoop. As of Hive 0.9.1 and 0.10, Hive can read Avro files directly. Cloudera also back-ported Avro compatibility for its Cloudera Hive 0.9.0.

Avro is highly structured, supporting nested records and arrays, and that makes it a good match for Hive, whose HiveQL syntax adopted SQL1999-style records/structures and arrays.

Avro is a great space-saving alternative to JSON, especially since it's not possible for Apache Pig to read gz-compressed JSON.

This is a collection of tips on using Avro -- most of them can be found here and there on the web, but here they all are in one place.

Schema tips

Below is an example schema that shows nested structures, nullable fields, optional sub-structures, arrays of structures, and arrays of strings.

{"name": "event", "type": "record", "fields": [
  {"name": "logLevel", "type": ["null", "int"], "default": null},
  {"name": "details", "type": 
    {"type": "record", "name": "detailstype", "fields": [
      {"name": "eventname", "type": ["null", "string"], "default": null},
    ]}
  },
  {"name": "metadata", "type": ["null",
    {"type": "record", "name": "metadatatype", "fields": [
      {"name": "referrer", "type": "string"},
    ]}]
  },
  {"name": "messages",   "type": ["null",
    {"type": "array", "name": "messagestype", "items": 
      {"name": "message", "type": "string"}
    }],
    "default": null
  }
]}

Declaring a 1-to-1 sub-record requires the verbose style above, where the "type" is declared to again to be a "type", which is of "record".
In order for a field to be "nullable", it must be declared as a UNION with type "null" as shown above for "type" using the JSON array syntax.
In order to avoid having to explicitly set a field when writing out an Avro record, it must be declared as "default": null
Combining the above two points, in order to declare a field nullable and default: null, the "null" in the UNION must appear first because it is the first type in a UNION list that determines the data type for the default, and "null" is an invalid default for all data types (e.g. "string", "int", etc.) with the exception of the "null" data type.
The above applies to sub-records as well. In the example above, "details" is a required sub-record (with an optional field "eventname"), whereas "metadata" is an optional sub-record (with, if the sub-record is present, a required field "referrer").
When writing out sub-records using the "generic" schema rather than the "specific" schema, the Java code to instantiate an empty sub-record is
```
GenericData.Record childdatum =
new GenericData.Record(schema.getField(fieldname).schema());
```
However, if the sub-record is optional (nullable), then to dig past the UNION, the Java code becomes
```
GenericData.Record childdatum =
new GenericData.Record(schema.getField(fieldname).schema().getTypes().get(1));
```
In both cases, the childdatum is populated and added to the parentdatum via
```
childdatum.put("aChildStringField", "A string value");
parentdatum.put(fieldname, childdatum);
```
Appending to existing Avro files in HDFS is not supported, even though appending to existing Avro files on the local filesystem and appending to existing non-Avro HDFS files are both supported. The open one-year-old unassigned JIRA ticket for such a feature is AVRO-1035.

To view a .avro file:

java -jar avro-tools-1.7.3.jar tojson mydata.avro

Hive using Avro

Example schema:

CREATE EXTERNAL TABLE mytable
  ROW FORMAT SERDE
  'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
  STORED as INPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
  OUTPUTFORMAT
  'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
  TBLPROPERTIES (
    'avro.schema.url'='hdfs:///user/cloudera/mytable.avsc');

[UPDATE 2013-07-18: The above Hive table creation TBLPROPERTIES no longer works with Hive 0.11. Instead one must use SERDEPROPERTIES.]

Although Apache documentation gives examples of 'avro.schema.url'='file:///home/...' that actually doesn't work for any non-trivial Hive query because map/reduce doesn't have access to the local filesystem.
It's not required to specify LOCATION for where the external Hive data will be stored. If it is not specified, then Hive creates the directory in HDFS of /user/hive/warehouse/<tablename>
It is possible to arbitrarily dump a bunch of Avro files into a Hive table directory, and Hive will automatically read them all when executing a query.
Not specific to Avro, but it is possible to specify one or more levels of PARTITION for a Hive table. Each level of PARTITION corresponds to another level of directory underneath /user/hive/warehouse/
(e.g. for a two-level PARTITION one might have a directory /user/hive/warehouse/mytable/datestring=2013-02-01/useridfirstchar=2). Using the point above, one can dump multiple Avro files into one such leaf directory of a PARTITIONed table.
However, one cannot just merely create partition directories in /user/hive/warehouse/<tablename> because the partition names also need to be added to the Hive metastore, which can be done with HiveQL via either the Hive CLI or JDBC:
```
ALTER TABLE mytable ADD IF NOT EXISTS PARTITION (ds='2013-02-01')
```
Also, note that contrary to examples in Apache documentation, "date" is not a valid partition name as it is a reserved word. "ds" (for "date string") is a popular alternative.
Hive supports SQL1999-style nested structures and nested arrays out of the box, and so is a good match for highly-structured Avro schemas. E.g., below is a valid Hive query against the example schema and CREATE TABLE from above:
```
SELECT messages[0].functionname FROM event
```
Hive also has functions for array handling such as "explode", which returns array elements from a single column as multiple result set rows. https://cwiki.apache.org/Hive/languagemanual-udf.html#LanguageManualUDF-explode
Not specific to Avro, but SHOW CREATE TABLE is not supported in Hive until 0.10. To do so in Hive 0.9, download HiveShowCreateTable.jar from https://issues.apache.org/jira/browse/HIVE-967 and then
```
hive --service jar HiveShowCreateTable.jar com.media6.hive2rdbms.job.ShowCreateTable -D db.name=default -D table.name=mytable 2>/dev/null
```