我正在尝试将数据集划分为训练集和测试集,在下面的代码中,df_min_max_scaled
是我的规范化数据,df
是我的非规范化数据,但出现错误
import numpy as np
train_ind = df.sample(frac=0.65, replace=True)
train = df_min_max_scaled[train_ind,]
test = df_min_max_scaled[-train_ind,]
train_labels = df[train_ind, 12]
test_labels = df[-train_ind, 12]
#train_labels
错误:
TypeError Traceback (most recent call last)
<ipython-input-50-a640d18b42fc> in <module>
1 import numpy as np
2 train_ind = df.sample(frac=0.65, replace=True)
----> 3 train = df_min_max_scaled[train_ind,]
4 test = df_min_max_scaled[-train_ind,]
5 train_labels = df[train_ind, 12]
它在第 3 行显示错误,我实际上是使用 Pandas 将 R 代码转换为 Python
train_ind = sample(nrow(wine), floor(0.65 * nrow(wine)))
train = wine2[train_ind,]
test = wine2[-train_ind,]
train_labels = wine[train_ind, 12]
test_labels = wine[-train_ind, 12]
答案 0 :(得分:1)
我建议您使用 sklearn 的 train_test_split
。这可能包含以下步骤:
df = pd.read_csv(...)
,如果您的数据来自 CSV 文件)from sklearn.model_selection import train_test_split
) 拆分它们,其中 df
是您的输入,labels
是真正的目标(您可以将 test_size 设置为您想要的任何值)。train, test, train_labels, test_labels = train_test_split(df, labels, test_size=0.35)
如果你真的坚持使用 Pandas 的示例函数,你可以这样做:
train = df.sample(frac=0.65)
test = df.drop(train.index)
train_labels = train.iloc[12]
提取(如果我理解正确,12
是标签在数据框列中的位置)df_min_max_scaled.loc[train.index]
只需确保对缩放/未缩放数据使用相同的索引即可。