I have large data sets with rows that measure the same thing (essentially duplicates with some noise). As part of a larger function I am writing, I want the user to be able to collapse these rows with a function of their choosing (e.g. mean, median).
My problem is that if I call the function directly, speed is much faster than if I use match.fun (which is what I need). MWE:
require(data.table) rows <- 100000 cols <- 1000 dat <- data.table(id=sample(LETTERS, rows, replace=TRUE), matrix(rnorm(rows*cols), nrow=rows)) aggFn <- "median" system.time(dat[, lapply(.SD, median), by=id]) system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])
On my system, timing results for the last 2 lines:
user system elapsed 1.112 0.027 1.141 user system elapsed 2.854 0.265 3.121
This becomes quite dramatic with larger data sets.
As a final point, I realize aggregate() can do this (and doesn't seem to suffer from this behavior), but I need to work with data.table objects due to data size.Answer1:
The reason is the gforce optimization data.table does for
median. You can see that if you set
help("GForce") for details.
If you compare for other functions you get more similar timings:
fun <- median aggFn <- "fun" system.time(dat[, lapply(.SD, fun), by=id]) system.time(dat[, lapply(.SD, match.fun(aggFn)), by=id])
A possible workaround to utilise the optimization if the function happens to be supported would be evaluating an expression build with it, e.g., using the dreaded
dat[, eval(parse(text = sprintf("lapply(.SD, %s)", aggFn))), by=id]
However, you would lose the small security using
If you have a list of functions the users can choose from, you could do this:
funs <- list(quote(mean), quote(median)) fun <- funs[] #select expr <- bquote(lapply(.SD, .(fun))) a <- dat[, eval(expr), by=id]