Saturday, March 9, 2013

Automatically deskew before machine learning in R

A common data pre-processing step in R is to deskew data, which is where if a histogram shows a lopsided distribution, apply a function such as log() before fitting a model. If there are a large number of columns, it can be tedious to eyeball each histogram, and manually substitute offending columns with their log() counterparts.

Helpfully, the e1071 package (notably for its support vector machine algorithms) provides a handy function to measure the skewness of data, called skewness(). Below is a function to automatically deskew an entire range of columns of a data frame.
deskew <- function(df, mincol=1, maxcol=ncol(df), threshold=1.10) {
  for (i in mincol:maxcol) {
    t <- log(1+df[[i]]-min(df[[i]]))
    if (abs(skewness(df[[i]])) > threshold * abs(skewness(t)))
      df[[i]] <- t
  }
  df
}
Deskewing data improves the performance of linear models, both regular lm()/glm() and linear svm() support vector machines. Understandably, it doesn't help with decision trees such as randomTrees().

No comments: