Thursday, January 24, 2013

ElephantBird now enables Hello World from Pig

In learning Apache Pig, I was surprised at how difficult it is to write "Hello World." From http://pig.apache.org/docs/r0.7.0/piglatin_ref2.html#Constants I would have thought the code below would have been legal:
-- Illegal Pig syntax
A = {('Hello'),('World')};
DUMP A;

However, it produces syntax errors.  The only way to create a relation -- the basic "variable" in Pig -- is through LOADing a file. As a "language lawyer" within whatever team I was in dating back to the K&R era, it bothers me that a "Hello World" program is impossible to write within a single code file. It makes it seem like Pig Latin is an incomplete language.

The people at Twitter's ElephantBird project have come up with a custom solution in response to my request on the Pig User mailing list.
http://mail-archives.apache.org/mod_mbox/pig-user/201301.mbox/%3CCAE7pYjZtwuxYZs6Ov54P-6SFRCkKPuL9Jwac9i-Rr%2BYsdhasNw%40mail.gmail.com%3E

This ElephantBird Java class allows converting what normally would be the filename specified with the LOAD command into a tuple. A hack. That works. But not without invoking code not distributed with Pig and not without ugliness.
languages = load 'en,fr,jp' using LocationAsTuple(',');

Sunday, January 20, 2013

SQL HAVING in R

Coming from a SQL background, I'm learning R, and wanted to do the equivalent of GROUP BY HAVING (well, really embedded in a subquery in order to subset the data), but the most obvious Google searches turned up nothing.  The answer is probably a no-brainer for R experts, but here it is in case future R-novices-SQL-experts Google for it.

Taking the example data set chickwts,

> data(chickwts)
> summary(chickwts)    
    weight          feed
Min.   :108.0   casein   :12
1st Qu.:204.5   horsebean:10
Median :258.0   linseed  :12
Mean   :261.3   meatmeal :11
3rd Qu.:323.5   soybean  :14
Max.   :423.0   sunflower:12

and supposing we want to exclude "low" popularity feeds that occur in the data set fewer than 12 times (yes, this is a contrived example), the below will discard those low-popularity feeds, leaving only the high-popularity feeds.

x <- sapply(split(chickwts,chickwts$feed), nrow)
chickwts <- chickwts[chickwts$feed %in% names(x[x>=12]),]

Thursday, January 3, 2013

Laravel PHP/MySQL/CentOS garbled strings

A quick but obscure tidbit for something that consumed my day today (reported here only because it wasn't reported anywhere else on the web):

If you're using the Laravel PHP framework with MySQL running on CentOS, you probably need to change the charset in application/config/database.php to be "latin1" instead of Laravel's example of "utf8".  Otherwise, you'll get garbled strings.

CentOS (even the current 6.3 distribution) comes with MySQL 5.1, which is from before Oracle acquired it and modernized it (e.g. with things like foreign key constraints). Another modernization is that whereas MySQL 5.1 defaults to the latin1 character set, its successor MySQL 5.5 defaults to the utf8 character set. Laravel's example database.php connection specifies utf8, so unless you've manually upgraded CentOS to MySQL 5.5, you will need to change database.php to specify the latin1 character set.