我有一个如下所示的数据集:
set.seed(43)
dt <- data.table(
a = rnorm(10),
b = rnorm(10),
c = rnorm(10),
d = rnorm(10),
e = sample(c("x","y"),10,replace = T),
f=sample(c("t","s"),10,replace = T)
)
我需要(例如)e列中每个值的列1:4中的负值计数。结果必须如下所示:
e neg_a_count neg_b_count neg_c_count neg_d_count
1: x 6 3 5 3
2: y 2 1 3 NA
1: s 4 2 3 1
2: t 4 2 5 2
这是我的代码:
for (k in 5:6) { #these are the *by* columns
for (i in 1:4) {#these are the columns whose negative values i'm counting
n=paste("neg",names(dt[,i,with=F]),"count","by",names(dt[,k,with=F]),sep="_")
dt[dt[[i]]<0, (n):=.N, by=names(dt[,k,with=F])]
}
}
dcast(unique(melt(dt[,5:14], id=1, measure=3:6))[!is.na(value),],e~variable)
dcast(unique(melt(dt[,5:14], id=2, measure=7:10))[!is.na(value),],f~variable)
显然会生成两个表,而不是一个:
e neg_a_count_by_e neg_b_count_by_e neg_c_count_by_e neg_d_count_by_e
1: x 6 3 5 3
2: y 2 1 3 NA
f neg_a_count_by_f neg_b_count_by_f neg_c_count_by_f neg_d_count_by_f
1: s 4 2 3 1
2: t 4 2 5 2
并且需要rbind来生成一个表。 这种方法通过添加8个附加列(4个数据列x 2 by 列)来修改dt,并且与e和f的级别相关的计数被回收(如预期的那样)。我想知道是否有更简洁的方法来实现结果,一个不修改dt的方法。此外,熔化后的铸造似乎效率低下,应该有更好的方法,特别是因为我的数据集有几个e和f样的列。
答案 0 :(得分:0)
如果只有两个分组列,我们可以在单独分组之后执行rbindlist
rbindlist(list(dt[,lapply(.SD, function(x) sum(x < 0)) , .(e), .SDcols = a:d],
dt[,lapply(.SD, function(x) sum(x < 0)) , .(f), .SDcols = a:d]))
# e a b c d
#1: y 2 1 3 0
#2: x 6 3 5 3
#3: s 4 2 3 1
#4: t 4 2 5 2
或者通过循环分组列名称使其更具动态性
rbindlist(lapply(c('e', 'f'), function(x) dt[, lapply(.SD,
function(.x) sum(.x < 0)), by = x, .SDcols = a:d]))
答案 1 :(得分:0)
您可以在聚合之前融化,如下所示:
cols <- c("a","b","c", "d")
melt(dt, id.vars=cols)[,
lapply(.SD, function(x) sum(x < 0)), by=value, .SDcols=cols]