我在R中使用以下功能将主题/样本分成训练和测试集,它工作得非常好。然而,在我的数据集中,受试者被分为2组(患者和对照受试者),因此,我希望分割数据,同时保持每个训练和测试组中患者和对照受试者的比例与完整时相同。数据集。我怎么能在R?如何修改以下功能,以便在将数据拆分为训练和测试集时考虑组关联?
# splitdf function will return a list of training and testing sets#
splitdf <- function(dataframe, seed=NULL) {
if (!is.null(seed))
set.seed(seed)
index <- 1:nrow(dataframe)
trainindex <- sample(index, trunc(length(index)/2))
trainset <- dataframe[trainindex, ]
testset <- dataframe[-trainindex, ]
list(trainset=trainset,testset=testset)
}
# apply the function
splits <- splitdf(Data, seed=808)
# it returns a list - two data frames called trainset and testset
str(splits)
# there are "n" observations in each data frame
lapply(splits,nrow)
# view the first few columns in each data frame
lapply(splits,head)
# save the training and testing sets as data frames
training <- splits$trainset
testing <- splits$testset`
#
示例:使用内置的虹膜数据并将数据集拆分为训练和测试集。这个数据集有150个样本,有一个叫做物种的因子,包括3个层次(setosa,versicolor和virginica)
data(iris)
splits <- splitdf(iris, seed=808)
str(splits)
lapply(splits,nrow)
lapply(splits,head)
training <- splits$trainset
testing <- splits$testset
正如你在这里看到的那样,函数“splitdf”在将数据分成训练和测试集时没有考虑组关联“物种”,结果是关于setosa,versicolor和virginica的数量样本训练和测试集与主数据集的不成比例。 那么,我如何修改该功能,以便在将数据拆分为训练和测试集时考虑组关联?
答案 0 :(得分:0)
以下是使用plyr
和模拟数据集的解决方案。
library(plyr)
set.seed(1001)
dat = data.frame(matrix(rnorm(1000), ncol = 10), treatment = sample(c("control", "control", "treatment"), 100, replace = T) )
# divide data set into training and test sets
tr_prop = 0.5 # proportion of full dataset to use for training
training_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
test_set = ddply(dat, .(treatment), function(., seed) { set.seed(seed); .[-sample(1:nrow(.), trunc(nrow(.) * tr_prop)), ] }, seed = 101)
# check that proportions are equal across datasets
ddply(dat, .(treatment), function(.) nrow(.)/nrow(dat) )
ddply(training_set, .(treatment), function(.) nrow(.)/nrow(training_set) )
ddply(test_set, .(treatment), function(.) nrow(.)/nrow(test_set) )
c(nrow(training_set), nrow(test_set), nrow(dat)) # lengths of sets
在这里,我使用set.seed()
来确保sample()
在使用ddply
构建训练/测试集时的相同行为。这让我觉得有点像黑客;也许有另一种方法可以使用**ply
的单个调用来实现相同的结果(但返回两个数据帧)。另一个选择(没有严重使用set.seed
)将使用dlply
,然后将结果列表的元素拼凑成训练/测试集:
set.seed(101) # for consistancy with 'ddply' above
split_set = dlply(dat, .(treatment), function(.) { s = sample(1:nrow(.), trunc(nrow(.) * tr_prop)); list(.[s, ], .[-s,]) } )
# join together with ldply()
training_set = ldply(split_set, function(.) .[[1]])
test_set = ldply(split_set, function(.) .[[2]])