Question

我有一个如下所示的数据集：

set.seed(43)
dt <- data.table(
    a = rnorm(10),
    b = rnorm(10),
    c = rnorm(10),
    d = rnorm(10),
    e = sample(c("x","y"),10,replace = T),
    f=sample(c("t","s"),10,replace = T)
    )

我需要（例如）e列中每个值的列1：4中的负值计数。结果必须如下所示：

   e neg_a_count neg_b_count neg_c_count neg_d_count
1: x           6           3           5           3
2: y           2           1           3          NA
1: s           4           2           3           1
2: t           4           2           5           2

这是我的代码：

for (k in 5:6) { #these are the *by* columns
 for (i in 1:4) {#these are the columns whose negative values i'm counting
   n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
   dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
  }
}

dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)

显然会生成两个表，而不是一个：

   e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
   1: x                6                3                5             3
   2: y                2                1                3             NA

   f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
   1: s                4                2                3             1
   2: t                4                2                5             2

并且需要rbind来生成一个表。这种方法通过添加8个附加列（4个数据列x 2 by 列）来修改dt，并且与e和f的级别相关的计数被回收（如预期的那样）。我想知道是否有更简洁的方法来实现结果，一个不修改dt的方法。此外，熔化后的铸造似乎效率低下，应该有更好的方法，特别是因为我的数据集有几个e和f样的列。

Answer 1

如果只有两个分组列，我们可以在单独分组之后执行rbindlist

rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d], 
  dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
#   e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2

或者通过循环分组列名称使其更具动态性

rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD, 
           function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))

Answer 2

您可以在聚合之前融化，如下所示：

cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[, 
    lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]

新数据表中的计算列而不更改原始数据

2 个答案: