r函数调用中的data.table用法

时间:2016-08-22 05:04:44

标签: r encoding data.table categorical-data binning

我想在函数调用中反复执行data.table任务:Reduce number of levels for large categorical variables我的问题类似于Data.table and get() command (R)pass column name in data.table using variable in R,但我无法让它工作

没有函数调用,这很好用:

# Load data.table
require(data.table)

# Some data
set.seed(1)
dt <- data.table(type = factor(sample(c("A", "B", "C"), 10e3, replace = T)),
                 weight = rnorm(n = 10e3, mean = 70, sd = 20))

# Decide the minimum frequency a level needs...
min.freq <- 3350

# Levels that don't meet minumum frequency (using data.table)
fail.min.f <- dt[, .N, type][N < min.freq, type]

# Call all these level "Other"
levels(dt$type)[fail.min.f] <- "Other"

但像

一样包裹
reduceCategorical <- function(variableName, min.freq){
  fail.min.f <- dt[, .N, variableName][N < min.freq, variableName]
  levels(dt[, variableName][fail.min.f]) <- "Other"
}

我只会收到如下错误:

 reduceCategorical(dt$x, 3350)
Fehler in levels(df[, variableName][fail.min.f]) <- "Other" : 
 trying to set attribute of NULL value

有时

Error is: number of levels differs

2 个答案:

答案 0 :(得分:3)

一种可能性是使用data.table::setattr定义您自己的重新调整功能,这将修改dt。像

这样的东西
DTsetlvls <- function(x, newl)  
   setattr(x, "levels", c(setdiff(levels(x), newl), rep("other", length(newl))))

然后在另一个预定义函数中使用它

f <- function(variableName, min.freq){
  fail.min.f <- dt[, .N, by = variableName][N < min.freq, get(variableName)]
  dt[, DTsetlvls(get(variableName), fail.min.f)]
  invisible()
}

f("type", min.freq)
levels(dt$type)
# [1] "C"     "other"

其他一些data.table替代方案

f <- function(var, min.freq) {
  fail.min.f <- dt[, .N, by = var][N < min.freq, get(var)]
  dt[get(var) %in% fail.min.f, (var) := "Other"]
  dt[, (var) := factor(get(var))]
}

或使用set / .I

f <- function(var, min.freq) {
  fail.min.f <- dt[, .I[.N < min.freq], by = var]$V1
  set(dt, fail.min.f, var, "other")
  set(dt, NULL, var, factor(dt[[var]]))
}

或与基础R结合(不会修改原始数据集)

f <- function(df, variableName, min.freq){
  fail.min.f <- df[, .N, by = variableName][N < min.freq, get(variableName)]
  levels(df$type)[fail.min.f] <- "Other"
  df
} 

或者,我们可以改为character s(如果typecharacter),你可以这样做

f <- function(var, min.freq) dt[, (var) := if(.N < min.freq) "other", by = var]

答案 1 :(得分:1)

您在包装器中引用的内容略有不同,要获得“类型”列名称,您使用的是整个variableName,它实际上是一个与获取级别相同的向量,您没有直接使用variableName在功能

中完成

错误是因为fail.min.f的值由于引用而变为NULL。