Question

我想对5个随机行进行1000次采样，并在数据框中对它们进行汇总。我对replace = FALSE有疑问，我想知道将它放到replace = TRUE的哪个位置。

我有一个5,000行的数据集，看起来像这样（简化）：

 Fund.ID Vintage Type Region.Focus Net.Multiple  Size
[1,] 4716  2003  2    US           1.02          Small
[2,] 2237  1998  25   Europe       0.03          Medium
[3,] 1110  1992  2    Europe       1.84          Medium
[4,] 12122 1997  25   Asia         2.04          Large 
[5,] 5721  2006  25   US           0.86          Mega
[6,] 730   1998  2    Europe       0.97          Small

这是我的函数，它以一个随机行开始，并包含绘制的5行的约束：

       simulate <- function(inv.period) {
          start <- sample_n(dataset, 1, replace=TRUE) #draw random first fund
          t <- start$Vintage:(start$Vintage + inv.period) #define investment period contingent on first fund
          fof <- dataset[sample(which(dataset$Vintage %in% t), 5, replace = FALSE), ] #include constraint, 5 funds in portfolio
        }

#replicate this function 1,000 times 
#and give out as a data frame with portfolios classified
        library(plyr)
        library(dplyr)
        fof.5 <- rdply(1000, simulate(4))
        rename(fof.5, FoF.ID = .n)

如果我在模拟函数中使用replace = FALSE（在fof＆lt ;-)之后，我收到此错误：

Error in sample.int(length(x), size, replace, prob) : 
  cannot take a sample larger than the population when 'replace = FALSE'

如果我把replace = TRUE，整个表达式都有效。但是，这不正确，因为在同一个样本中可以绘制两行，这是我不想要的。

有没有办法在绘制行时放置replace = FALSE，但是为整个数据集添加replace = TRUE？它应该是：一行只能在样本中绘制一次，但可以在另一个样本中再绘制一次。

Answer 1

我建议拿出dplyr内容，没有必要。其次，为名为matches的匹配项添加一个变量，然后对该向量的长度或数字5进行采样，取较小者。最后，我会使用data.table::rbindlist，它有一个参数来创建一个索引，指示绘制了哪个绘图。输出为data.table，如果您不熟悉，可以在最后使用as.data.frame(rbindlist(....))将其转回data.frame。：

library(data.table)
simulate <- function(inv.period) {
  start <- dataset[sample(nrow(dataset), 1, replace=TRUE),]
  t <- start$Vintage:(start$Vintage + inv.period)
  matches <- which(dataset$Vintage %in% t)
  dataset[sample(matches, min(length(matches),5), replace = FALSE), ]
}

r <- replicate(1000, simulate(5), simplify=FALSE)
rbindlist(r, idcol="draw")

R复制样本函数而不替换

1 个答案: