R - 通过相对子集

时间:2018-04-26 13:59:46

标签: r data.table

我想在数据表中添加一个列,该数据表是从其他列计算值并且尚未找到方法来执行此操作。

DT <- data.table("userId" = c("user1", "user1", "user1", "user1", "user2", "user2", "user2", "user2", "user2"),
                 "topicId" = "topic1",
                 "attempt" = c(1, 2, 3, 4, 1, 2, 3, 4, 5),
                 "score" = c(50, 0, 50, 20, 20, 100, 100, 100, 100))

   userId topicId attempt score
1:  user1  topic1       1    50
2:  user1  topic1       2     0
3:  user1  topic1       3    50
4:  user1  topic1       4    20
5:  user2  topic1       1    20
6:  user2  topic1       2   100
7:  user2  topic1       3   100
8:  user2  topic1       4   100
9:  user2  topic1       5   100

最终,我想添加以下两列,以显示每个用户/主题组在首次尝试和之前尝试时的得分差异:

   userId topicId attempt score scoreDiffFromFirst scoreDiffFromPrev
1:  user1  topic1       1    50                 NA                NA
2:  user1  topic1       2     0                -50               -50
3:  user1  topic1       3    50                  0                50
4:  user1  topic1       4    20                -30               -30
5:  user2  topic1       1    20                 NA                NA
6:  user2  topic1       2   100                 80                80
7:  user2  topic1       3   100                 80                 0
8:  user2  topic1       4   100                 80                 0
9:  user2  topic1       5   100                 80                 0

我尝试了以下内容:

for(i in 2:max(DT$attempt)) {
  DT[attempt == i, scoreDiffFromFirst := score - DT[attempt == 1 & userId == userId & topicId == topicId, score]]
  DT[attempt == i, scoreDiffFromPrev := score - DT[attempt == i - 1 & userId == userId & topicId == topicId, score]]
}

直到它遇到未在每个用户/主题组中表示的尝试号码。这意味着最后一行不正确,如下所示,因为user1没有第五次尝试:

   userId topicId attempt score scoreDiffFromFirst scoreDiffFromPrev
1:  user1  topic1       1    50                 NA                NA
2:  user1  topic1       2     0                -50               -50
3:  user1  topic1       3    50                  0                50
4:  user1  topic1       4    20                -30               -30
5:  user2  topic1       1    20                 NA                NA
6:  user2  topic1       2   100                 80                80
7:  user2  topic1       3   100                 80                 0
8:  user2  topic1       4   100                 80                 0
9:  user2  topic1       5   100                 50**              80**

很高兴从任何建议中学习,但理想情况下,我喜欢上述的一些变化,因为速度的好处是极端的。

提前致谢。

3 个答案:

答案 0 :(得分:4)

您可以使用by=

DT[order(attempt), c("scoreDiffFromFirst", "scoreDiffFromPrev") := 
  .(replace(score, 1, NA) - first(score), score - shift(score))
, by=.(userId,topicId)]

# or
DT[order(attempt), `:=`(
  scoreDiffFromFirst = replace(score, 1, NA) - first(score), 
  scoreDiffFromPrev = score - shift(score)
), by=.(userId,topicId)]

给出了

   userId topicId attempt score scoreDiffFromFirst scoreDiffFromPrev
1:  user1  topic1       1    50                 NA                NA
2:  user1  topic1       2     0                -50               -50
3:  user1  topic1       3    50                  0                50
4:  user1  topic1       4    20                -30               -30
5:  user2  topic1       1    20                 NA                NA
6:  user2  topic1       2   100                 80                80
7:  user2  topic1       3   100                 80                 0
8:  user2  topic1       4   100                 80                 0
9:  user2  topic1       5   100                 80                 0

对于语法背后的直觉,您可以在加载包时查看启动消息中提到的材料:

library(data.table)
     

data.table 1.10.4.3

     

最快的学习方式(由data.table作者提供):https://www.datacamp.com/courses/data-analysis-the-data-table-way

     

文档:?data.tableexample(data.table)browseVignettes("data.table")

     

发布说明,视频和幻灯片:http://r-datatable.com

答案 1 :(得分:2)

dplyr让这很容易。 firstlag可以轻松获取相对位置的得分值,而group_byuserIdtopicId对行进行分组,然后按组应用分析,不是在桌子上。

DT <- DT %>%
    group_by(userId, topicId) %>%
    mutate(scoreDiffFromFirst = score - first(score),
           scoreDIffFromPrev = score - lag(score))

DT
# A tibble: 9 x 6
# Groups:   userId, topicId [2]
  userId topicId attempt score scoreDiffFromFirst scoreDIffFromPrev
  <fct>  <fct>     <dbl> <dbl>              <dbl>             <dbl>
1 user1  topic1       1.   50.                 0.               NA 
2 user1  topic1       2.    0.               -50.              -50.
3 user1  topic1       3.   50.                 0.               50.
4 user1  topic1       4.   20.               -30.              -30.
5 user2  topic1       1.   20.                 0.               NA 
6 user2  topic1       2.  100.                80.               80.
7 user2  topic1       3.  100.                80.                0.
8 user2  topic1       4.  100.                80.                0.
9 user2  topic1       5.  100.                80.                0.

唯一的区别是,scoreDiffFromFirst0,而不是NA首次尝试行。如果那不是你想要的,你可以直接替换它们:

DT[DT$attempt == 1, "scoreDiffFromFirst"] <- NA

答案 2 :(得分:0)

您还可以使用ifelse方法。

DT[, `:=`(
  scoreDiffFromFirst = ifelse(rowid(topicId) == 1, NA, score - first(score)), 
  scoreDiffFromPrev = c(NA, diff(score))), by = userId]