Question

我有数据集，其中有130000条记录和15个变量。

我要描述的变量是IsActive。问题是只有15000条记录，此变量设置为1，其余记录设置为0。

首先，我想将源数据分成两个数据集：

20％~30k记录 - ＆gt;训练数据集

80％~120k记录 - ＆gt;验证数据集。

我希望在训练数据集中包含active = 1的5k记录，在验证数据集中包含active = 1的10k记录，并且可以轻松调整。

我该怎么做？

我已经做的是：

set.seed(2)
ind <- sample(2, nrow(mydata), replace = TRUE, prob=c(0.8, 0.2))

当我想获得80％的mydata时：

newdata=mydata[ind == 1,]

Answer 1

你的问题仍然没有意义：130,000中的20％不是30,000。修复所有逻辑不一致的最简单假设是数据集有150,000条记录，所以我使用了它。

这是一种方法：

# sample data
set.seed(1)                  # for reproducible example
df <- data.frame(id=1:150000,
                 IsActive=sample(0:1,150000,replace=T,p=c(0.9,0.1)),
                 x=rnorm(150000), y=runif(150000),z=rpois(150000,l=1))
sum(df$IsActive==1)          # validate
# [1] 14887

s1 <- sample(which(df$IsActive==1),5000)
s2 <- sample(which(df$IsActive==0),25000)
train <- df[c(s1,s2),]
test  <- df[c(-s1,-s2),]
# validate
any(test$id %in% train$id)   # train and test are disjoint
# [1] FALSE
sum(train$IsActive==1)       # 5000
# [1] 5000
sum(test$IsActive==1)        # the rest
# [1] 9887

将数据拆分为非代表性类别的培训和评估数据集

1 个答案: