我有一个数据集,其中包含针对特定决策r
的每个选民v
的投票结果d
。我的数据如下所示:
d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c(y,y,n,n,n,n,n,n,y,y,y,y,y,a,y)
df <- data.frame(d,v,r)
并非每位选民都在每次选举中投票。我想要做的是看看其他选民是否与特定选民进行同一次呼叫(让我们说v == 8
)。通常我会使用dplyr
:
df %>% group_by(d) %>% mutate(like8 = ifelse(r == r[v == 8], 1, 0))
我遇到的问题是,特定选民v == 8
没有对每项决定进行记录投票(这与投票的弃权不同,后者被记录在案)。因此,我收到以下错误。
mutate_impl(.data,dots)中的错误: 列
like8
必须是长度3(组大小)或1,而不是0
到目前为止,我所做的是编写ifelse和循环的组合以解决此问题。
with(df,
for (i in unique(d)) {
if(8 %in% v){
for (j in r[d == i]) {
df$like8[d == i & r == j] <- ifelse(j == r[v == 8], 1, 0)
}
} else {
for (j in r[d == i]){
df$like8[d == i & r == j] <- NA
}
}
}
)
- 请注意:我从来没有正式接受过'良好'编程惯例的指示,所以我的括号位置可能不清楚并且可以接受建议。
我遇到的问题是我的实际数据集有超过500,000个观测值,这非常慢。我已经看到使用data.table
的{{3}}解决方案,当值没有丢失时,但我不明白data.table
足以知道如何使其适用于我的情况。
答案 0 :(得分:1)
试试这个:
df %>%
group_by(d) %>%
mutate(
like8 = {
if (sum(v == 8) > 0) as.numeric(r == r[v == 8])
else NA
}
)
它将测试包装在if / else语句中,检查是否有选民8. as.numeric
语句与您编写的内容相同,但应该更快。是1/0。
答案 1 :(得分:0)
目前尚不清楚预期产量。如果我们遵循@Melissa Key的tidyverse答案中的方法,data.table
中的类似方法(如帖子中提到的OP)将是
library(data.table)
setDT(df)[, like8 := if(8 %in% v) +(r == r[v == 8]) else NA_integer_, by = d]
df
# d v r like8
# 1: 1 6 y 0
# 2: 1 7 y 0
# 3: 1 8 n 1
# 4: 1 9 n 1
# 5: 2 6 n 1
# 6: 2 7 n 1
# 7: 2 8 n 1
# 8: 2 9 n 1
# 9: 3 6 y NA
#10: 3 7 y NA
#11: 3 9 y NA
#12: 4 6 y 0
#13: 4 7 y 0
#14: 4 8 a 1
#15: 4 9 y 0
或者我们通过将if/else
拆分为两个步骤来避开8 %in% v
,并仅将其分配给满足条件的那些(i1 <- setDT(df)[, .I[8 %in% v], by = d]$V1
df[i1, like8 := +(r == r[v==8]), by = d]
)
NA
“like8”中的其他值默认填充为d <- c(1,1,1,1,2,2,2,2,3,3,3,4,4,4,4)
v <- c(6,7,8,9,6,7,8,9,6,7,9,6,7,8,9)
r <- c('y','y','n','n','n','n','n','n','y','y','y','y','y','a','y')
df <- data.frame(d,v,r)
$data=json_encode('[{"day":0,"periods":[]},{"day":1,"periods":[{"start":"08:00","end":"10:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"},{"start":"11:00","end":"12:30","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"}]},{"day":2,"periods":[{"start":"20:00","end":"00:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"}]},{"day":3,"periods":[]},{"day":4,"periods":[{"start":"10:00","end":"12:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"},{"start":"13:00","end":"14:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"},{"start":"15:00","end":"16:00","title":"","backgroundColor":"rgba(254, 0, 0, 0.7)","borderColor":"rgb(42, 60, 255)","textColor":"rgb(0, 0, 0)"}]},{"day":5,"periods":[]},{"day":6,"periods":[]}]');
答案 2 :(得分:0)
另一种使用2个连接的解决方案:
#initialize column
DT1[, like8 := NA_integer_][
#set to 0 if voter 8 voted on decision
DT1[v==8L], like8 := 0L, on=.(d)][
#set to 1 if other voters voted the same in a particular decision
DT1[v==8L], like8 := 1L, on=.(d, r)]
数据:
library(data.table)
library(microbenchmark)
#generate dummy data
set.seed(0L)
numD <- 100L
numV <- 1e4L
DT <- unique(data.table(d=sample(numD, numD*numV, replace=TRUE),
v=sample(numV, numD*numV, replace=TRUE)))
DT[, r:=sample(c('y','n','a'), .N, replace=TRUE)]
setorder(DT, d, v, r)
#set key to speed up the subsetting to voter
setkey(DT, d, v)
DT1 <- copy(DT)