我有一个如下所示的数据集:
d=data.frame(ID = rep(1:7,1),
Group1=c('A','C','B','C','C','A','B'),
Group2=c('B','A','C','B','B','B','D'))
ID Group1 Group2
1 A B
2 C A
3 B C
4 C B
5 C B
6 A B
7 B D
我需要根据Group1随机抽样1个案例。 Group1有三种类型:A,B,C。我需要从每种类型中取样1。
同时,样本的Group2类型不会在样本的Group2中重复。
例如,如果我只根据Group1进行采样:
dsample=d %>% group_by(Group1) %>%sample_n(size=1)
然后样本如下:
ID Group1 Group2
1 A B
7 B D
4 C B
在样品的Group2中,样品中重复了B.为避免重复Group2类型,当按照Group1类型进行采样时,采样应选择ID = 2,以便样本看起来像这样:
ID Group1 Group2
1 A B
7 B D
2 C A
答案 0 :(得分:1)
一种可能的方法:继续重新取样,直到获得理想的结果(或者直到你失败了足够多次才能达到预期的结果):
# data
d=data.frame(ID = rep(1:7,1),
Group1=c('A','C','B','C','C','A','B'),
Group2=c('B','A','C','B','B','B','D'))
# first attempt
dsample = d %>% group_by(Group1) %>% sample_n(size=1)
# if first attempt doesn't work, try again & again (I put an upper limit at 100 runs)
i = 1
while(length(unique(dsample$Group2)) < nrow(dsample) & i < 100){
dsample = d %>% group_by(Group1) %>% sample_n(size=1)
i = i + 1
}
> dsample
# A tibble: 3 x 3
# Groups: Group1 [3]
ID Group1 Group2
<int> <fctr> <fctr>
1 1 A B
2 3 B C
3 2 C A
如果无法获得所需的独特组合:
# example where "A" & "B" in Group 1 both have only "A" as Group2 values
d2=data.frame(ID = rep(1:7,1),
Group1=c('A','C','B','C','C','A','B'),
Group2=c('A','A','A','C','B','A','A'))
# same code as before
d2sample = d2 %>% group_by(Group1) %>% sample_n(size=1)
i = 1
while(length(unique(d2sample$Group2)) < nrow(d2sample) & i < 100){
d2sample = d2 %>% group_by(Group1) %>% sample_n(size=1)
i = i + 1
}
# fail after 100 rounds of resampling
> d2sample
# A tibble: 3 x 3
# Groups: Group1 [3]
ID Group1 Group2
<int> <fctr> <fctr>
1 6 A A
2 7 B A
3 5 C B
> i
[1] 100
答案 1 :(得分:1)
我的第一个想法是一个循环,然后我意识到我们可以看看我们如何从不同的角度采样。更好的解决方案是一次只采样一行,然后从仅包含!= Group1和!= Group2之前采样的池中采样下一行。这应该快得多:
f <- function(){
x <- sample_n(d,1)
x <- rbind(x,sample_n(d[which(!d$Group1 %in% x$Group1 & !d$Group2 %in% x$Group2),],1))
x <- rbind(x,sample_n(d[which(!d$Group1 %in% x$Group1 & !d$Group2 %in% x$Group2),],1))
print(x)
}
f()
ID Group1 Group2
6 6 A B
2 2 C A
3 3 B C
如果您知道至少有2个唯一可能的样本,则每次都是随机的,非重复的输出。
如果有人建议如何以这种方式更简洁地重复功能,请随时告诉我。但总的来说,似乎这种方式可能是最有效的。
答案 2 :(得分:0)
试试这个递归函数
d=data.frame(ID = rep(1:7,1),
Group1=c('A','C','B','C','C','A','B'),
Group2=c('B','A','C','B','B','B','D'))
dsample=d %>% group_by(Group1) %>%sample_n(size=1)
myfun <- function(ans, allowed, restricted, counter, end) {
allowed <- setdiff(allowed, ans)
allowed1 <- setdiff(allowed, restricted[counter])
if (length(allowed) == 0 | counter > end) {
if (length(ans) < end) {
ans <- c(ans, rep(NA, end-length(ans)))
}
return(ans)
} else {
counter <- counter + 1
ans <- c(ans, sample(allowed1, 1))
myfun(ans, allowed, restricted, counter, end)
}
}
replicate(10,myfun(ans=NULL, unique(d$Group2), dsample$Group1, 1, nrow(dsample)))
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
[1,] "D" "C" "B" "C" "B" "B" "B" "B" "D" "D"
[2,] "C" "D" "D" "A" "D" "C" "C" "A" "A" "C"
[3,] "A" "B" "A" "D" "A" "A" "D" "D" "B" "B"
注意每个复制的输出按列组织