目标是将df2
合并到df1
,其中df2
中的关键值不是唯一的,但是在每个都具有概率值的组中。一个简单的例子:
df1
# key
#1 A
#2 B
#3 C
#4 C
#5 A
#6 A
#7 D
df2
# key code prob
#1 A 1 0.75
#2 A 2 0.25
#3 B 1 0.95
#4 B 2 0.05
#5 C 1 0.20
#6 C 2 0.25
#7 C 3 0.55
#8 D 1 0.33
#9 D 2 0.33
#10 D 3 0.33
预期结果类似于以下code
根据df2
中的概率分配# key code
#1 A 1
#2 B 1
#3 C 3
#4 C 3
#5 A 2
#6 A 1
#7 D 2
:
df1 <- structure(list(key = structure(c(1L, 2L, 3L, 3L, 1L, 1L, 4L), .Label = c("A",
"B", "C", "D"), class = "factor")), .Names = "key", class = "data.frame", row.names = c(NA,
-7L))
df2 <- structure(list(key = structure(c(1L, 1L, 2L, 2L, 3L, 3L, 3L,
4L, 4L, 4L), .Label = c("A", "B", "C", "D"), class = "factor"),
code = c(1L, 2L, 1L, 2L, 1L, 2L, 3L, 1L, 2L, 3L), prob = c(0.75,
0.25, 0.95, 0.05, 0.2, 0.25, 0.55, 0.33, 0.33, 0.33)), .Names = c("key",
"code", "prob"), class = "data.frame", row.names = c(NA, -10L
))
数据:
{{1}}
答案 0 :(得分:2)
我很确定你只是想要:
library(dplyr)
df2 %>%
group_by(key) %>%
sample_n(1, weight = prob) %>%
right_join(df1)
答案 1 :(得分:2)
对apply
中的每一行使用df1
,对df2
中的可用代码进行抽样,加权prob
,以获取key
的当前值:
df1$code = apply(df1, 1, function(x) {
sample(df2$code[df2$key==x["key"]], 1, prob=df2$prob[df2$key==x["key"]])
})
答案 2 :(得分:1)
我认为这就是你想要的。
library(dplyr)
df1$id <- seq(nrow(df1))
df3 <- merge(df1, df2, by = "key", all.x = TRUE)
df3 %>% group_by(id) %>% sample_n(1, weight = prob)
我为df1生成了id变量,并将df1与df2中的所有可能代码合并。然后,dplyr::sample_n
为每个ID提供加权采样。
典型的结果将是
Source: local data frame [7 x 4]
Groups: id
key id code prob
1 A 1 1 0.75
2 B 2 1 0.95
3 C 3 3 0.55
4 C 4 3 0.55
5 A 5 1 0.75
6 A 6 1 0.75
7 D 7 1 0.33