编辑：

Question

我有以下data.frame（比以下示例更长）

sub height  group
1   1.55    a
2   1.65    a
3   1.76    b
4   1.77    a
5   1.58    c
6   1.65    d
7   1.82    c
8   1.91    c
9   1.77    b
10  1.69    b
11  1.74    a
12  1.75    c

使用以下代码进行数据分区：

library("caret")
train = createDataPartition(df$group, p = 0.50)
partition = df[train, ]

因此每个组的概率为0.5。我的问题在于下面的例子中，有时会挑选来自d组的主题，有时不会（因为d组真的很小）。我想创建一个约束，在我创建的每个分区中，将挑选来自每个组的atlist 1主题。

任何优雅的解决方案？

我提出了一个看起来不那么优雅的解决方案：

allGroupSamles <- c()
for (i in unique(df$groups))
{
  allGroupSamles <- c(allGroupSamles , sample(rownames(df[df$groups == i, ]) , 1, replace = TRUE))
}
allGroupSamles <- as.integer(allGroupSamles )

train = createDataPartition(df$groups, p = 0.50)[[1]]
train <- c(allGroupSamles , train)

partition= df[unique(train), ]

Answer 1

您可以在split上使用data.frame，并在每个组中进行采样，记录一半记录或1，以较大者为准：

# apply a function over the split data.frame
samples <- lapply(split(df, df$group), function(x) {

  # the function takes a random sample of half the records in each group
  # by using `ceiling`, it guarantees at least one record
  s <- sample(nrow(x), ceiling(nrow(x)/2))
  x[s,]
})

train <- do.call(rbind, samples)

编辑：

如果您需要数字矢量：

s <- tapply(1:nrow(df), df$group, function(x) {
  sample(x, ceiling(length(x)/2))
})

do.call(c, s)

R：使用额外的术语创建数据分区

1 个答案:

编辑：