按组,将值与特定值匹配

时间:2018-05-16 03:24:10

标签: r

我有一个数据集,其中包含针对特定决策r的每个选民v的投票结果d。我的数据如下所示:

d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c(y,y,n,n,n,n,n,n,y,y,y,y,y,a,y)
df <- data.frame(d,v,r)

并非每位选民都在每次选举中投票。我想要做的是看看其他选民是否与特定选民进行同一次呼叫(让我们说v == 8)。通常我会使用dplyr

df %>% group_by(d) %>% mutate(like8 = ifelse(r == r[v == 8], 1, 0))

我遇到的问题是,特定选民v == 8没有对每项决定进行记录投票(这与投票的弃权不同,后者被记录在案)。因此,我收到以下错误。

  

mutate_impl(.data,dots)中的错误:       列like8必须是长度3(组大小)或1,而不是0

到目前为止,我所做的是编写ifelse和循环的组合以解决此问题。

with(df,
    for (i in unique(d)) {
        if(8 %in% v){ 
            for (j in r[d == i]) {
            df$like8[d == i & r == j] <- ifelse(j == r[v == 8], 1, 0)
                                 }
                    } else {
            for (j in r[d == i]){
            df$like8[d == i & r == j] <- NA
                                } 
                           }
                         }
)

- 请注意:我从来没有正式接受过'良好'编程惯例的指示,所以我的括号位置可能不清楚并且可以接受建议。

我遇到的问题是我的实际数据集有超过500,000个观测值,这非常慢。我已经看到使用data.table的{​​{3}}解决方案,当值没有丢失时,但我不明白data.table足以知道如何使其适用于我的情况。

3 个答案:

答案 0 :(得分:1)

试试这个:

df %>% 
    group_by(d) %>% 
    mutate(
      like8 = {
        if (sum(v == 8) > 0) as.numeric(r == r[v == 8])
        else NA
      }
    )

它将测试包装在if / else语句中,检查是否有选民8. as.numeric语句与您编写的内容相同,但应该更快。是1/0。

答案 1 :(得分:0)

目前尚不清楚预期产量。如果我们遵循@Melissa Key的tidyverse答案中的方法,data.table中的类似方法(如帖子中提到的OP)将是

library(data.table)
setDT(df)[, like8 := if(8 %in% v) +(r == r[v == 8]) else NA_integer_, by = d]
df
#    d v r like8
# 1: 1 6 y     0
# 2: 1 7 y     0
# 3: 1 8 n     1
# 4: 1 9 n     1
# 5: 2 6 n     1
# 6: 2 7 n     1
# 7: 2 8 n     1
# 8: 2 9 n     1
# 9: 3 6 y    NA
#10: 3 7 y    NA
#11: 3 9 y    NA
#12: 4 6 y     0
#13: 4 7 y     0
#14: 4 8 a     1
#15: 4 9 y     0

或者我们通过将if/else拆分为两个步骤来避开8 %in% v,并仅将其分配给满足条件的那些(i1 <- setDT(df)[, .I[8 %in% v], by = d]$V1 df[i1, like8 := +(r == r[v==8]), by = d]

NA

“like8”中的其他值默认填充为d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4) v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9) r <- c('y','y','n','n','n','n','n','n','y','y','y','y','y','a','y') df <- data.frame(d,v,r)

数据

$data=json_encode('[{"day":0,"periods":[]},{"day":1,"periods":[{"start":"08:00","end":"10:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"},{"start":"11:00","end":"12:30","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"}]},{"day":2,"periods":[{"start":"20:00","end":"00:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"}]},{"day":3,"periods":[]},{"day":4,"periods":[{"start":"10:00","end":"12:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"},{"start":"13:00","end":"14:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"},{"start":"15:00","end":"16:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"}]},{"day":5,"periods":[]},{"day":6,"periods":[]}]');

答案 2 :(得分:0)

另一种使用2个连接的解决方案:

#initialize column
DT1[, like8 := NA_integer_][
    #set to 0 if voter 8 voted on decision
    DT1[v==8L], like8 := 0L, on=.(d)][
        #set to 1 if other voters voted the same in a particular decision
        DT1[v==8L], like8 := 1L, on=.(d, r)]

数据:

library(data.table)
library(microbenchmark)

#generate dummy data
set.seed(0L)
numD <- 100L
numV <- 1e4L
DT <- unique(data.table(d=sample(numD, numD*numV, replace=TRUE),
    v=sample(numV, numD*numV, replace=TRUE)))
DT[, r:=sample(c('y','n','a'), .N, replace=TRUE)]
setorder(DT, d, v, r)

#set key to speed up the subsetting to voter
setkey(DT, d, v)

DT1 <- copy(DT)