Question

我有一个大型数据集正在尝试使用。我目前正在尝试将我的数据集分为三个不同的数据帧，以用于不同的测试点。

ind<-sample(3, nrow(df1), replace =TRUE, prob=c(0.40, 0.50, 0.10))
df2<-as.data.frame(df1[ind==1,1:27])
df3<-as.data.frame(df1[ind==2, 1:27])
df4<-as.data.frame(df1[ind==3,1:27])

但是，df1中的第一列是发票编号，由于包含退货和错误，多行可以具有相同的发票编号。我正在尝试找到一种将数据随机拆分的方法，但将具有相同发票编号的所有行都保留在一起。

关于我如何能够做到这一点的任何建议？

Answer 1

您可以对唯一的发票编号进行抽样，然后选择带有这些发票编号的行，而不是对行进行抽样。

## Some sample data
df1 = data.frame(invoice=sample(10,20, replace=T), V = rnorm(20))

## sample the unique values
ind = sample(3, length(unique(df1$invoice)), replace=T)

## Select rows by sampled invoice number
df1[df1$invoice %in% unique(df1$invoice)[ind==1], 1:2]
   invoice           V
2        8 -0.67717939
6        9 -0.89222154
9        8 -0.71756069
14       8 -0.03539096
15       2  0.38453752
16       9 -0.16298835
17       9 -0.30823521
20       2 -0.60198259

Answer 2

ind1 <- which(df1[,1] == 1)
ind2 <- which(df1[,1] == 2)
ind3 <- which(df1[,1] == 3)

df2 <- as.data.frame(df1[sample(ind1, length(ind1), replace = TRUE), 1:27])
df3 <- as.data.frame(df1[sample(ind2, length(ind2), replace = TRUE), 1:27])
df4 <- as.data.frame(df1[sample(ind3, length(ind3), replace = TRUE), 1:27])

ind确定哪些行包含发票编号1,2,3。然后，为了创建随机数据帧，仅从您希望的行中抽取一个随机样本。希望这会有所帮助。

随机分离数据帧，但将相同的值保持在一起

2 个答案: