插入符-基于多个变量创建分层数据集

时间:2019-02-07 04:44:55

标签: r r-caret

在R包插入符中,我们可以使用函数createDataPartition()(或用于交叉验证的createFolds())基于多个变量创建分层的训练和测试集吗?

以下是一个变量的示例:

inTrain = createDataPartition(df$yourFactor, df$yourFactor2, p = 2/3, list = FALSE)

在上面的代码中,训练集和测试集由'df $ yourFactor'分层。但是是否可以使用多个变量进行分层(例如“ df $ yourFactor”和“ df $ yourFactor2”)?以下代码似乎有效,但我不知道它是否正确:

SELECT CG.NAME, COUNT(*) AS TotalPhraseCount 
FROM Phrase AS P
left JOIN Category AS C ON P.CategoryId = C.Id
                       and C.Selected = 1
left JOIN CategoryGroup AS CG ON C.GroupId = CG.Id

GROUP BY C.GroupId

2 个答案:

答案 0 :(得分:0)

有一种更好的方法。

set.seed(1)
n <- 1e4
d <- data.frame(yourFactor = sample(1:5,n,TRUE), 
                yourFactor2 = rbinom(n,1,.5),
                yourFactor3 = rbinom(n,1,.7))

地层指示器

d$group <- interaction(d[, c('yourFactor', 'yourFactor2')])

样本选择

indices <- tapply(1:nrow(d), d$group, sample, 30 )

获取子样本

subsampd <- d[unlist(indices, use.names = FALSE), ]

这是对yourFactoryourFactor2的每个组合制作大小为30的随机分层样本。

答案 1 :(得分:0)

如果使用tidyverse,这非常简单。

例如:

df <- df %>%
  mutate(n = row_number()) %>% #create row number if you dont have one
  select(n, everything()) # put 'n' at the front of the dataset
train <- df %>%
  group_by(var1, var2) %>% #any number of variables you wish to partition by proportionally
  sample_frac(.7) # '.7' is the proportion of the original df you wish to sample
test <- anti_join(df, train) # creates test dataframe with those observations not in 'train.'