Question

我有一个数据框，其中包含以下列：date，outcome（no或yes）和group（{{1} }或one）：

two

我现在通过set.seed(36) Data <- data.frame( date = sample((as.Date(as.Date("2011-12-30"):as.Date("2012-01-04"), origin="1970-01-01")), 1000, replace = TRUE), group = sample(c("one", "two"), 1000, replace = TRUE), outcome = sample(c("no", "yes"), 1000, replace = TRUE))和group交叉，如下：

outcome

给我一个像

的结果

mytable <- table(Data$outcome, Data$group)
mytable

我现在想要随机丢弃（或从中抽样 - 不确定哪种方式更好;我相信它们应该具有相同的效果，尽管我不确定）来自其中一个单元格的行，比如顶部 - 右上角（组one two no 260 271 yes 235 234和结果two）并保留10％的数据。

有人能指出我必须使用哪些命令和条件的正确方向吗？

Answer 1

你可以这样做......

idx <- which(Data$group=="one" & Data$outcome=="no") #identify relevant group

Data2 <- Data[-sample(idx, 0.9*length(idx), replace=FALSE),] #sample 90% to remove

table(Data2$outcome, Data2$group)
      one two
  no   28 260
  yes 234 235

table(Data$outcome, Data$group)         
      one two
  no  271 260
  yes 234 235

奇怪的是，我以set.seed的价值从你那里得到了相反的列！

Answer 2

以下是tidyverse解决方案：

library(tidyverse)
Data2 <-
  Data %>%
  split(group_indices(.,group,outcome)) %>%
  purrr::modify_if(~first(.$group)=="two" & first(.$outcome)=="no",
                   ~slice(.,sample(nrow(.),round(nrow(.)/10)))) %>%
  bind_rows



table(Data2$outcome, Data2$group)
# one two
# no  271  26
# yes 234 235

Answer 3

编写一个函数以使其更通用：

get_reduced_data <- function(Data, group, outcome) {
   #Get indices of the subset which satisfies our condition
   indx = which(Data$group == group & Data$outcome == outcome)
   #Select only 10% from the subset and keep remaining rows as it is
   Data[c(sample(indx, length(indx) * 0.1), setdiff(seq(nrow(Data)), indx)), ]
}

df = get_reduced_data(Data, "two", "no")

table(df$outcome, df$group)

#      one two
#  no  271  26
#  yes 234 235

df = get_reduced_data(Data, "one", "no")

table(df$outcome, df$group)

#      one two
#  no   27 260
#  yes 234 235

在两个条件下随机丢弃子行（交叉制表）

3 个答案: