我有一些代码允许我从数据集中取两个随机抽取的样本,应用一个函数并重复该过程一定次数(参见以下相关问题的代码:How to bootstrap a function with replacement and return the output)。
示例数据:
> dput(a)
structure(list(index = 1:30, val = c(14L, 22L, 1L, 25L, 3L, 34L,
35L, 36L, 24L, 35L, 33L, 31L, 30L, 30L, 29L, 28L, 26L, 12L, 41L,
36L, 32L, 37L, 56L, 34L, 23L, 24L, 28L, 22L, 10L, 19L), id = c(1L,
2L, 2L, 3L, 3L, 4L, 5L, 6L, 7L, 7L, 8L, 9L, 10L, 11L, 12L, 13L,
14L, 15L, 16L, 16L, 17L, 18L, 19L, 20L, 21L, 21L, 22L, 23L, 24L,
25L)), .Names = c("index", "val", "id"), class = "data.frame", row.names = c(NA,
-30L))
代码:
library(plyr)
extractDiff <- function(P){
subA <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a random sample of 15 rows
subB <- P[sample(nrow(P), 15, replace=TRUE), ] # takes a second random sample of 15 rows
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
set.seed(42)
fin <- do.call(rbind, replicate(10, extractDiff(a), simplify=FALSE))
我不想拍摄大小为15的两个随机抽取的样本,而是想要随机抽取一个大小为15的样本,然后在第一次随机抽取后提取数据集中的剩余15行(即{{1} }等于第一个随机抽取的15个obs的样本,subA
将等于subA被取出后剩余的15个obs)。我真的不知道该怎么做。任何帮助将非常感激。谢谢!
答案 0 :(得分:1)
在这种情况下,我只是将P
的行号(存储在下面的index
中)洗牌,然后为subA
选择前15个,为{{选择第二个15 1}}:
subB
答案 1 :(得分:1)
我相信你可以通过对代码进行一些小改动来做到这一点。
extractDiff <- function(P){
sampleset = sample(nrow(P), 15, replace=FALSE) #select the first 15 rows, note replace=FALSE
subA <- P[sampleset, ] # takes the 15 selected rows
subB <- P[-sampleset, ] # takes the remaining rows in the set
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
但请注意,这与引导捆绑无法兼容,因为引导需要更换。另一方面,如果要从数据集中进行替换采样,然后从第一次采样中未选择的数据集进行采样,则可以执行以下操作。
extractDiff <- function(P){
sampleset1 = sample(nrow(P), 15, replace=TRUE) #select the first 15 rows, note replace=TRUE
sampleset2 = sample((1:nrow(P))[-unique(sampleset1)],15,replace=TRUE) #selects only from rows not used in sampleset1
subA <- P[sampleset1, ] # takes the 15 selected rows
subB <- P[sampleset2, ] # takes the 15 selected rows in the remaining set set
meanA <- mean(subA$val)
meanB <- mean(subB$val)
diff <- abs(meanA-meanB)
outdf <- c(mA = meanA, mB= meanB, diffAB = diff)
return(outdf)
}
然而,根据您的应用,这仍然可能不太理想,因为第二个数据集更可能具有多个值的实例而不是第一个。如果你选择较小比例的总集合,那么问题就更少了。使用&#39; shuffle&#39;你可能最好将套装分成两部分。并且从两半更换采样,这样两组更均匀,但这将阻止第一组再次成为真正的引导捆绑设置。