假设有以下最小数据:
id choice relevant
1 0 0
1 0 1
1 1 1
2 1 0
2 0 1
2 0 1
我想计算id
的每个值具有choice
值1的时间百分比,但仅当relevant
的值为1时才计算...并将其添加为我原始数据框的列。具体来说,我想:
id choice relevant pct1
1 0 0 50
1 0 1 50
1 1 1 50
2 1 0 0
2 0 1 0
2 0 1 0
*已更新以获取子集。最初的方法(无需处理子集)很棒,我鼓励人们将原始答案保留为更一般的情况。但是,我尝试将原始解决方案从@DavideBottoli扩展到以下内容:
#let df stand in for the data frame above
x = df %>% group_by(id, relevant) %>%
mutate(pct1 = 100*sum(relevant==1 & choice==1)/n())
得到了这个:
id choice relevant pct1
1 0 0 0
1 0 1 50
1 1 1 50
2 1 0 0
2 0 1 0
2 0 1 0
**进一步更新:问题比choice
是整数的情况更为普遍......一个好的答案应该假设choice
是一个分类变量,所以{{1}调用
***进一步更新后:在撰写本文时,只有一种解决方案试图解决子集问题,并且由于未知原因,它生成了与用于实际问题的数据不同长度的向量。我最后只是在python中编写了一个快速for循环,只是将值写入电子表格。
答案 0 :(得分:3)
catch (IndexOutOfRangeException)
{
//error message here
}
解决方案
data.table
答案 1 :(得分:2)
library(dplyr)
df <- tibble(id = c(1,1,2,2),
choice = c(0,1,0,0))
output <- df %>%
group_by(id) %>%
mutate(pct1 = 100 * sum(choice == 1)/n())
很抱歉延迟,但如果您想更新第一个公式,可以使用以下内容:
library(dplyr)
df <- tibble(id = c(1,1,1,2,2,2),
choice = c(0,0,1,1,0,0),
relevant = c(0,1,1,0,1,1))
output <- df %>%
group_by(id) %>%
mutate(pct1 = 100 * sum(choice == 1 & relevant == 1)/sum(relevant == 1 ))
答案 2 :(得分:2)
对ave
dt <- data.frame(id = c(1,1,2,2),
choice = c(0,1,0,0))
within(dt, pct <- ave(choice, id, FUN = mean))
# id choice pct1
# 1 1 0 0.5
# 2 1 1 0.5
# 3 2 0 0.0
# 4 2 0 0.0
编辑,考虑有问题的更新。
dt <- data.frame(id = c(1,1,1,2,2,2,3,3),
choice = c(0,0,"A","A","B",0,0,0), relevant = c(0,1,1,0,1,1,0,0))
chosen_value = "A"
# we use by to apply custom function to data frames split by id
within(dt, pct <- unlist(by(dt, dt$id, function(x)
rep(
if (sum(x$relevant == 1) == 0) 0 else {
mean((x$choice == chosen_value)[x$relevant == 1])}
, nrow(x))
)))
# id choice relevant pct
# 1 1 0 0 0.5
# 2 1 0 1 0.5
# 3 1 A 1 0.5
# 4 2 A 0 0.0
# 5 2 B 1 0.0
# 6 2 0 1 0.0
# 7 3 0 0 0.0
# 8 3 0 0 0.0
答案 3 :(得分:1)
在基地R:
df$pct <- 100*tapply(df$choice, df$id, mean)[df$id]
对于具有relevant == 1
的子集:
df$pct <- 100*tapply(df$choice, df[, c('id', 'relevant')], mean)[df$id, "1"]
答案 4 :(得分:0)
对于您的示例,此代码将完成此任务:
library(dplyr)
df <-data.frame(id = c(1,1,2,2),
choice = c(0,1,0,0))
df %>% group_by(id) %>%
mutate(percent=mean(choice)*100)
# A tibble: 4 x 3
# Groups: id [2]
id choice percent
<dbl> <dbl> <dbl>
1 1.00 0 50.0
2 1.00 1.00 50.0
3 2.00 0 0
4 2.00 0 0
考虑使用mutate
加group_by
而不是summarise
答案 5 :(得分:0)
dplyr解决方案:
df %>%
filter(relevant==1) %>%
group_by(id) %>%
summarize(pct1 = 100*sum(choice==1)/n()) %>%
right_join(df)
# # A tibble: 6 x 4
# id pct1 choice relevant
# <dbl> <dbl> <dbl> <dbl>
# 1 1 50 0 0
# 2 1 50 0 1
# 3 1 50 1 1
# 4 2 0 1 0
# 5 2 0 0 1
# 6 2 0 0 1