Question

我有一个关于显示示例函数剩余的问题。对于学校，我们必须制作测试数据帧和火车数据帧。我必须验证的数据只有一个列车数据帧。原始数据帧有2158个观测值。他们制作了一个包含1529个观测值的列车数据框。

set.seed(22)
train <- Gary[sample(1:nrow(Gary), 1529,
                 replace=FALSE),]

train[, 1] <- as.factor(unlist(train[, 1]))
train[, 2:201] <- as.numeric(as.factor(unlist(train[, 2:201])))

现在我想拥有剩余的＃34;在不同的数据框架中。

你们有些人知道怎么做吗？

Answer 1

如果您知道训练指数，则可以在R中使用否定索引。所以我们只需要重写你的第一行：

set.seed(22)
train_indices <- sample(1:nrow(Gary), 1529, replace=FALSE)
train <- Gary[train_indices, ]
test <- Gary[-train_indices, ]
# Proceed with rest of script.

Answer 2

可以使用setdiff()功能完成此操作。

编辑：请注意@AlexR使用否定索引还有另一个答案，如果索引仅用于子集化，则更简单。

但是，首先我们需要创建一些虚拟数据，因为OP没有提供任何带有问题的数据（以后使用，请阅读How to make a great R reproducible example?）：

虚拟数据

创建包含2158行和两列的虚拟数据框：

n <- 2158
Gary <- data.frame(V1 = seq_len(n), V2 = sample(LETTERS, n , replace =TRUE))
str(Gary)
#'data.frame':  2158 obs. of  2 variables:
# $ V1: int  1 2 3 4 5 6 7 8 9 10 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 21 11 24 10 5 17 18 1 25 7 ...

已抽样和剩余的行

首先，在后续步骤中对Gary进行子集化之前，计算采样行和剩余行的向量：

set.seed(22)
sampled_rows <- sample(seq_len(nrow(Gary)), 1529, replace=FALSE)
leftover_rows <- setdiff(seq_len(nrow(Gary)), selected_rows)

train <- Gary[sampled_rows, ]
leftover <- Gary[leftover_rows, ]

str(train)
#'data.frame':  1529 obs. of  2 variables:
# $ V1: int  657 1025 2143 1123 1817 1558 1324 1590 898 801 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 19 16 25 15 2 5 8 14 20 3 ...
str(leftover)
#'data.frame':  629 obs. of  2 variables:
# $ V1: int  2 5 6 7 8 9 10 12 20 24 ...
# $ V2: Factor w/ 26 levels "A","B","C","D",..: 11 5 17 18 1 25 7 25 7 18 ...

leftover是一个数据框，其中包含尚未采样的Gary行。

验证

要验证，我们会再次合并train和leftover并对行进行排序以与原始数据框进行比较：

recombined <- rbind(train, leftover)
identical(Gary, recombined[order(recombined$V1), ])
#[1] TRUE

来自样本函数的剩菜

2 个答案:

虚拟数据

已抽样和剩余的行

验证