条件随机抽样

时间:2019-03-11 13:16:43

标签: r

我需要进行条件随机采样,但是我不确定如何实现此目标...因此,我们将不胜感激:) 假设我的数据框如下:

df <- data.frame(newspaper = sample(c("Newspaper 1", "Newspaper 2", "Newspaper 3", "Newspaper 4"), 90, replace = TRUE), event = sample(c("Event 1", "Event 2", "Event 3", "Event 4", "Event 5"), 90, replace = TRUE), article = sample(c(0:1), 90, replace = TRUE))
df <- subset(df, article >0)

[article = 1表示有一篇文章。将是真实数据集中实际文章的标题]

newspaper + event的每个组合中有两个以上文章时,我基本上需要选择两篇随机文章,否则保留所有文章。 我不太确定如何构建循环以实现此目标。 谢谢! 弗雷德

1 个答案:

答案 0 :(得分:1)

我们可以group_by newspaper以及eventif一组中有2行以上,然后随机选择2行或else选择所有行。

library(dplyr)

df %>%
  group_by(newspaper, event) %>%
  slice(if(n() > 2) sample(1:n(), 2) else 1:n())

# newspaper   event   article
#   <fct>       <fct>     <int>
# 1 Newspaper 1 Event 1       1
# 2 Newspaper 1 Event 1       1
# 3 Newspaper 1 Event 2       1
# 4 Newspaper 1 Event 2       1
# 5 Newspaper 1 Event 3       1
# 6 Newspaper 1 Event 3       1
# 7 Newspaper 1 Event 4       1
# 8 Newspaper 1 Event 4       1
# 9 Newspaper 2 Event 1       1
#10 Newspaper 2 Event 2       1
# … with 24 more rows

或者我们可以通过使用if来避免pmin的情况,其中df %>% group_by(newspaper, event) %>% slice(sample(1:n(), pmin(2, n()))) 在组中选择2或行数之间的最小值进行采样。

{{1}}