我需要进行条件随机采样,但是我不确定如何实现此目标...因此,我们将不胜感激:) 假设我的数据框如下:
df <- data.frame(newspaper = sample(c("Newspaper 1", "Newspaper 2", "Newspaper 3", "Newspaper 4"), 90, replace = TRUE), event = sample(c("Event 1", "Event 2", "Event 3", "Event 4", "Event 5"), 90, replace = TRUE), article = sample(c(0:1), 90, replace = TRUE))
df <- subset(df, article >0)
[article = 1表示有一篇文章。将是真实数据集中实际文章的标题]
当newspaper
+ event
的每个组合中有两个以上文章时,我基本上需要选择两篇随机文章,否则保留所有文章。
我不太确定如何构建循环以实现此目标。
谢谢!
弗雷德
答案 0 :(得分:1)
我们可以group_by
newspaper
以及event
和if
一组中有2行以上,然后随机选择2行或else
选择所有行。
library(dplyr)
df %>%
group_by(newspaper, event) %>%
slice(if(n() > 2) sample(1:n(), 2) else 1:n())
# newspaper event article
# <fct> <fct> <int>
# 1 Newspaper 1 Event 1 1
# 2 Newspaper 1 Event 1 1
# 3 Newspaper 1 Event 2 1
# 4 Newspaper 1 Event 2 1
# 5 Newspaper 1 Event 3 1
# 6 Newspaper 1 Event 3 1
# 7 Newspaper 1 Event 4 1
# 8 Newspaper 1 Event 4 1
# 9 Newspaper 2 Event 1 1
#10 Newspaper 2 Event 2 1
# … with 24 more rows
或者我们可以通过使用if
来避免pmin
的情况,其中df %>%
group_by(newspaper, event) %>%
slice(sample(1:n(), pmin(2, n())))
在组中选择2或行数之间的最小值进行采样。
{{1}}