Question

我想在数据表中添加一个列，该数据表是从其他列计算值并且尚未找到方法来执行此操作。

DT <- data.table("userId" = c("user1", "user1", "user1", "user1", "user2", "user2", "user2", "user2", "user2"),
                 "topicId" = "topic1",
                 "attempt" = c(1, 2, 3, 4, 1, 2, 3, 4, 5),
                 "score" = c(50, 0, 50, 20, 20, 100, 100, 100, 100))

   userId topicId attempt score
1:  user1  topic1       1    50
2:  user1  topic1       2     0
3:  user1  topic1       3    50
4:  user1  topic1       4    20
5:  user2  topic1       1    20
6:  user2  topic1       2   100
7:  user2  topic1       3   100
8:  user2  topic1       4   100
9:  user2  topic1       5   100

最终，我想添加以下两列，以显示每个用户/主题组在首次尝试和之前尝试时的得分差异：

   userId topicId attempt score scoreDiffFromFirst scoreDiffFromPrev
1:  user1  topic1       1    50                 NA                NA
2:  user1  topic1       2     0                -50               -50
3:  user1  topic1       3    50                  0                50
4:  user1  topic1       4    20                -30               -30
5:  user2  topic1       1    20                 NA                NA
6:  user2  topic1       2   100                 80                80
7:  user2  topic1       3   100                 80                 0
8:  user2  topic1       4   100                 80                 0
9:  user2  topic1       5   100                 80                 0

我尝试了以下内容：

for(i in 2:max(DT$attempt)) {
  DT[attempt == i, scoreDiffFromFirst := score - DT[attempt == 1 & userId == userId & topicId == topicId, score]]
  DT[attempt == i, scoreDiffFromPrev := score - DT[attempt == i - 1 & userId == userId & topicId == topicId, score]]
}

直到它遇到未在每个用户/主题组中表示的尝试号码。这意味着最后一行不正确，如下所示，因为user1没有第五次尝试：

   userId topicId attempt score scoreDiffFromFirst scoreDiffFromPrev
1:  user1  topic1       1    50                 NA                NA
2:  user1  topic1       2     0                -50               -50
3:  user1  topic1       3    50                  0                50
4:  user1  topic1       4    20                -30               -30
5:  user2  topic1       1    20                 NA                NA
6:  user2  topic1       2   100                 80                80
7:  user2  topic1       3   100                 80                 0
8:  user2  topic1       4   100                 80                 0
9:  user2  topic1       5   100                 50**              80**

很高兴从任何建议中学习，但理想情况下，我喜欢上述的一些变化，因为速度的好处是极端的。

提前致谢。

Answer 1

您可以使用by=

DT[order(attempt), c("scoreDiffFromFirst", "scoreDiffFromPrev") := 
  .(replace(score, 1, NA) - first(score), score - shift(score))
, by=.(userId,topicId)]

# or
DT[order(attempt), `:=`(
  scoreDiffFromFirst = replace(score, 1, NA) - first(score), 
  scoreDiffFromPrev = score - shift(score)
), by=.(userId,topicId)]

给出了

   userId topicId attempt score scoreDiffFromFirst scoreDiffFromPrev
1:  user1  topic1       1    50                 NA                NA
2:  user1  topic1       2     0                -50               -50
3:  user1  topic1       3    50                  0                50
4:  user1  topic1       4    20                -30               -30
5:  user2  topic1       1    20                 NA                NA
6:  user2  topic1       2   100                 80                80
7:  user2  topic1       3   100                 80                 0
8:  user2  topic1       4   100                 80                 0
9:  user2  topic1       5   100                 80                 0

对于语法背后的直觉，您可以在加载包时查看启动消息中提到的材料：

library(data.table)
data.table 1.10.4.3

最快的学习方式（由data.table作者提供）：https://www.datacamp.com/courses/data-analysis-the-data-table-way

文档：?data.table，example(data.table)和browseVignettes("data.table")

发布说明，视频和幻灯片：http://r-datatable.com

Answer 2

dplyr让这很容易。 first和lag可以轻松获取相对位置的得分值，而group_by按userId和topicId对行进行分组，然后按组应用分析，不是在桌子上。

DT <- DT %>%
    group_by(userId, topicId) %>%
    mutate(scoreDiffFromFirst = score - first(score),
           scoreDIffFromPrev = score - lag(score))

DT
# A tibble: 9 x 6
# Groups:   userId, topicId [2]
  userId topicId attempt score scoreDiffFromFirst scoreDIffFromPrev
  <fct>  <fct>     <dbl> <dbl>              <dbl>             <dbl>
1 user1  topic1       1.   50.                 0.               NA 
2 user1  topic1       2.    0.               -50.              -50.
3 user1  topic1       3.   50.                 0.               50.
4 user1  topic1       4.   20.               -30.              -30.
5 user2  topic1       1.   20.                 0.               NA 
6 user2  topic1       2.  100.                80.               80.
7 user2  topic1       3.  100.                80.                0.
8 user2  topic1       4.  100.                80.                0.
9 user2  topic1       5.  100.                80.                0.

唯一的区别是，scoreDiffFromFirst是0，而不是NA首次尝试行。如果那不是你想要的，你可以直接替换它们：

DT[DT$attempt == 1, "scoreDiffFromFirst"] <- NA

Answer 3

您还可以使用ifelse方法。

DT[, `:=`(
  scoreDiffFromFirst = ifelse(rowid(topicId) == 1, NA, score - first(score)), 
  scoreDiffFromPrev = c(NA, diff(score))), by = userId]

R - 通过相对子集

3 个答案: