Sunday, January 20, 2013


Coming from a SQL background, I'm learning R, and wanted to do the equivalent of GROUP BY HAVING (well, really embedded in a subquery in order to subset the data), but the most obvious Google searches turned up nothing.  The answer is probably a no-brainer for R experts, but here it is in case future R-novices-SQL-experts Google for it.

Taking the example data set chickwts,

> data(chickwts)
> summary(chickwts)    
    weight          feed
Min.   :108.0   casein   :12
1st Qu.:204.5   horsebean:10
Median :258.0   linseed  :12
Mean   :261.3   meatmeal :11
3rd Qu.:323.5   soybean  :14
Max.   :423.0   sunflower:12

and supposing we want to exclude "low" popularity feeds that occur in the data set fewer than 12 times (yes, this is a contrived example), the below will discard those low-popularity feeds, leaving only the high-popularity feeds.

x <- sapply(split(chickwts,chickwts$feed), nrow)
chickwts <- chickwts[chickwts$feed %in% names(x[x>=12]),]

No comments: