使用应用于多个列的多个函数在data.table中生成多个新列

时间:2019-07-29 16:48:48

标签: r data.table

我想将多个函数应用于data.table的几列,并根据输出生成新列。我在这里找到了类似的问题,但是提供的答案似乎并未解决我的确切问题,例如:

Apply multiple functions to multiple columns in data.table

ddply to multiple columns equivalent in data.table

R data.table - Apply function A to some columns and function B to some others

生成一些数据:

set.seed(1)
p <- rep(seq(1:10),4)
p

time1 <- sample(1:40, 40, replace=TRUE)
time2 <- sample(1:40, 40, replace=TRUE)
contact1 <- sample(rep(c("personal", "nonpersonal"),20), 40)
contact2 <- sample(rep(c("personal", "nonpersonal"),20), 40)
closeness1 <- sample(1:10, 40, replace=TRUE)
closeness2 <- sample(1:10, 40, replace=TRUE)

dt <- data.table::data.table(p, time1, time2, contact1, contact2, closeness1, closeness2)

这有效,但由于我分别为每个列运行此操作,效率似乎很低:

# s1
dt[, c("scliq.s", "symgr.s") :=list(length(which(.SD<=7)), length(which(.SD>7 & .SD<=31))), .SDcols="time1", by = p]

# d1
dt[, c("scliq.d", "symgr.d") :=list(length(which(.SD<=7)), length(which(.SD>7 & .SD<=31))), .SDcols="time2", by = p]

# s2
dt[, c("pers.s", "npers.s") :=list(length(which(.SD=="personal"))/length(which(.SD=="personal" | .SD=="nonpersonal")), length(which(.SD=="nonpersonal"))/length(which(.SD=="personal" | .SD=="nonpersonal"))), .SDcols="contact1", by = p]

# d2
dt[, c("pers.d", "npers.d") :=list(length(which(.SD=="personal"))/length(which(.SD=="personal" | .SD=="nonpersonal")), length(which(.SD=="nonpersonal"))/length(which(.SD=="personal" | .SD=="nonpersonal"))), .SDcols="contact2", by = p]

我尝试修改其他帖子中的类似解决方案。为了简单起见,我仅针对# s1# d1进行了尝试,但最终还是想做# s1# d1# s2和{{1 }}一劳永逸。我不拘泥于# d2,只需要计算每种情况下的实例数(length(which)也可以,但是我无法得到table()来保存{ {1}}):

data.table

我成功生成了所需的列数。但是,所有四列的每一行都包含相同的值,即使它可能不相同,如以下代码段的输出所示:

table()

我想做的第二个步骤是根据上述time1和time2列的标准(再次分别针对p的每个值,即# option 1 my.summary = function(x) list(count1 = length(which(x<=7)), count2 = length(which(x>7 & x<=31))) dt[, c("scliq.s", "symgr.s", "scliq.d", "symgr.d") :=unlist(lapply(.SD, my.summary)), .SDcols = c("time1", "time2"), by = p] # option 2, note: I wasn't sure how to adapt sum/mean to a nested function call (i.e., length(which)) dt$dday <- 1 # add a constant column dt <- dcast(dt, dday~dday, fun=list(sum, mean), value.var = c("time1", "time2")) )和如上所述,将输出保存在新列中,每种格式均使用“ scliq” /“ symgr”格式。例如,我想为time1中所有得分等于或低于7的time1中的所有分数,以及time1中介于8到31之间的所有分数(closeness2和time2同样)计算closeness1的均值。

我还应该注意,我知道如何使用tidyverse软件包解决此问题,但是为了简化和提高效率,我热衷于学习如何在dt[, unlist(lapply(.SD, my.summary)), .SDcols = c("time1", "time2"), by = p] 中进行操作。任何提示或实际上的解决方案将不胜感激。

1 个答案:

答案 0 :(得分:1)

您使用my.summary解决方案不起作用的原因是,unlist默认是递归的, 因此最终将所有嵌套列表中的所有值打包到单个向量中, 并且data.table最终以静默方式回收值。 考虑到Jaap的评论, 你可以这样写:

my.summary = function(x) list(sum(x<=7), sum(x>7 & x<=31))

dt[, c("scliq.s", "symgr.s", "scliq.d", "symgr.d") := unlist(lapply(.SD, my.summary), recursive = FALSE),
   .SDcols = c("time1", "time2"), by = p]

我可以考虑2种选择, 第一个使用.SDby, 有时可能很慢:

dt[, c("mean1", "mean2") := .(.SD[time1 <= 7, mean(closeness1)], 
                              .SD[time2 > 7 & time2 <= 31, mean(closeness2)]),
   by = p,
   .SDcols = time1:closeness2]

另一种选择是在子表中计算均值,然后再加入:

dt[dt[time1 <= 7, .(ans = mean(closeness1)), by = p], mean1 := ans, on = "p"]
dt[dt[time2 > 7 & time2 <= 31, .(ans = mean(closeness2)), by = p], mean2 := ans, on = "p"]

根据您的实际数据, 一个可能比另一个更快 所以你应该给他们计时。