Question

我有一个银行数据集，其中有5％的违约者，其余的都是好的（非违约者）。

我想创建一个样本，其中有30％的违约者，70％的非违约者。

假设我的数据集是数据，并且它有一个名为＆＃34的列;默认＆＃34;表示0或1，如果我的原始数据集只有5％的默认值，我如何获得30％默认值，70％非默认值的样本。

有人可以提供R代码。那太好了。我尝试了以下内容来获得100个替换

的随机样本

data[sample(1:nrow(data),size=100,replace=TRUE),]

但我怎样才能确保分裂为30％，70％？

Answer 1

sample有一个选项prob，它代表一个概率权重向量，用于获取被采样的向量元素。因此，您可以使用prob=c(0.3,0.7)作为sample的参数。

例如

sample(0:1, 100, replace=TRUE, prob=c(0.3,0.7))

Answer 2

假设df是您的数据框，default是指示默认值的列。

无需替换的样品：

df[c(sample(which(df$default),30), sample(which(!df$default),70)),]

要替换样本（即可能重复记录）：

df[c(sample(which(df$default),30,TRUE), sample(which(!df$default),70,TRUE)),]

或者，如果您不想指定违约者和非违约者的确切数量，您可以为每一行指定抽样概率：

set.seed(1)
df <- data.frame(default=rbinom(250,1,.5), y=rnorm(250))

n <- 100 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
#  0  1 
# 61 39 

n <- 150 # could be any number, but closer you get to nrow(df) the less the weights matters
s <- sample(seq_along(df$default), n, prob=ifelse(df$default, .3, .7))
table(df$default[s])
#
#  0  1 
# 97 53

增加样本中违约者的数量

2 个答案: