使用data.table基于一个事件的过滤器将变量分配给组

时间:2017-06-06 06:06:06

标签: r data.table

我需要根据用户是否至少完成一次操作来创建新列。

 USER ACTION
 A    Attack
 A    Jump
 B    Attack
 B    Die
 C    Attack
 C    Die
 C    Jump
 D    Die

期望的结果将是:

 ## If ACTION == something
 ## Create new column and apply '1' for that user for all rows 

 USER ACTION HAS_DIED HAS_JUMPED HAS_ATTACKED
 A    Attack    0         1            1
 A    Jump      0         1            1
 B    Attack    1         0            1
 B    Die       1         0            1
 C    Attack    1         1            1
 C    Die       1         1            1
 C    Jump      1         1            1
 D    Die       1         0            0

所以我最终可以得到一个唯一的USER列表

 USER  HAS_DIED HAS_JUMPED HAS_ATTACKED
 A       0         1            1
 B       1         0            1
 C       1         1            1
 D       1         0            0

我一直在使用以下方法对每个功能进行过滤和合并,但这会使大量功能变得繁琐。例)

 ## mark logs of deaths 
 df[ACTION == "Die", HAS_DIED := 1] 

 ## get unique list of users that have died 
 died_df <- df[HAS_DIED == 1]

 ## merge and change none 1s to 0s 
 merged_df <- died_df[df, on = "USER"]
 merged_df$HAS_DIED[is.na(merged_df$HAS_DIED)] <- 0

寻找更快,更有效的方法来实现这一目标!

2 个答案:

答案 0 :(得分:2)

由于初始对象为data.table,我们可以使用dcast中的data.table并且效率非常高

library(data.table)
setnames(dcast(setDT(df1), USER ~ACTION, length), -1, 
         c('HAS_ATTACKED', 'HAS_DIED', 'HAS_JUMPED'))[]
#    USER HAS_ATTACKED HAS_DIED HAS_JUMPED
#1:    A            1        0          1
#2:    B            1        1          0
#3:    C            1        1          1
#4:    D            0        1          0

答案 1 :(得分:1)

使用dplyrtidyr

df %>% 
  mutate(n=1) %>% 
  spread(ACTION, n, fill=0) %>%
  setNames(c('USER', 'HAS_ATTACKED', 'HAS_DIED', 'HAS_JUMPED'))

#   USER HAS_ATTACKED HAS_DIED HAS_JUMPED
# 1    A            1        0          1
# 2    B            1        1          0
# 3    C            1        1          1
# 4    D            0        1          0