我想从一组预定义的ID中随机抽取两个ID。
但是,在分组数据帧上使用sample
和dplyr::filter
会返回意外的结果“不同的样本大小”,例如,如果我执行sample(x,2)
,有时会得到2,有时会得到不等于2的数字。 / p>
df <- structure(list(ID = c(1L, 1L, 1L, 2L, 2L, 3L, 3L, 4L, 4L, 4L,
5L, 5L, 6L, 6L), Sub = structure(c(1L, 1L, 1L, 2L, 2L, 3L, 3L,
4L, 4L, 4L, 5L, 5L, 6L, 6L), .Label = c("a", "b", "c", "d", "f",
"g"), class = "factor")), class = "data.frame", row.names = c(NA,
-14L))
samp.vec <- c(1,2,3,4,5)
library(dplyr)
set.seed(123)
#Return Different sample size, Not working
df %>% group_by(ID)%>%filter(ID %in% sample(samp.vec,2)) %>% count(ID)
df %>% group_by(ID)%>%filter(ID %in% sample(samp.vec,2)) %>% count(ID)
set.seed(123)
#Return one sample size, Working
df %>% group_by(ID)%>% ungroup() %>% filter(ID %in% sample(samp.vec,2)) %>% count(ID)
df %>% group_by(ID)%>% ungroup() %>% filter(ID %in% sample(samp.vec,2)) %>% count(ID)
一种解决方案是在ungroup()
之前使用filter
。有人知道为什么会这样吗?
答案 0 :(得分:1)
分组时,将对每个组进行操作。因此,您不仅拥有一对ID,例如固定的 ID%in%c(2,3)。为了更清楚一点,我们省略filter
并查看sample(samp.vec, 2)
的结果,
df %>%
group_by(ID) %>%
mutate(v1 = toString(sample(samp.vec, 2)))
# A tibble: 14 x 3
# Groups: ID [6]
# ID Sub v1
# <int> <fct> <chr>
# 1 1 a 2, 3
# 2 1 a 2, 3
# 3 1 a 2, 3
# 4 2 b 1, 4
# 5 2 b 1, 4
# 6 3 c 3, 1
# 7 3 c 3, 1
# 8 4 d 4, 5
# 9 4 d 4, 5
#10 4 d 4, 5
#11 5 f 4, 2
#12 5 f 4, 2
#13 6 g 2, 4
#14 6 g 2, 4
因此它将过滤每个组的2个ID。因此,有时您会有2个,有时3个,有时全部。