Question

我经常需要从data.table中筛选出低差异的列。列名称不是事先知道的。

dt = data.table(mtcars)

# calculate standard deviation with arbitrary max value of 1:
mask = dt[,lapply(.SD, function(x) sd(x, na.rm = TRUE) > 1)]

# The columns with the FALSE values in row 1 need to be removed
mask.t = t(mask)
mask.t = which(mask.t)
dt[,mask.t,with=FALSE]

上面的方法很笨重。有没有更优雅的方法来过滤出列统计量为TRUE的data.table中的列？

Answer 1

这些工作：

dt[, names(mask)[unlist(mask)], with=FALSE] 

dt[, names(which(unlist(mask))), with=FALSE]

现在一起：

variance.filter = function(df) {
  mask = df[,lapply(.SD, function(x) sd(x,na.rm = TRUE) > 1)]
  df = df[, names(mask)[unlist(mask)], with=FALSE] 
}

根据摘要统计信息筛选出data.table列

1 个答案: