Question

我正在使用sklearn的{{1}}创建数据的训练集和测试集。

我的脚本在下面：

train_test_split

运行它，我得到以下信息：

C：\ Users \ jerry \ Desktop> python test.py（6262，）（6262，253）追溯（最近一次通话最近）：文件“ test.py”，位于第37行 knn.fit（train_data，train_classes）文件“ C：\ Python367-64 \ lib \ site-packages \ sklearn \ neighbors \ base.py”，行 872，适合 X，y = check_X_y（X，y，“ csr”，multi_output = True）文件“ C：\ Python367-64 \ lib \ site-packages \ sklearn \ utils \ validation.py”，行 729，在check_X_y中 check_consistent_length（X，y）文件“ C：\ Python367-64 \ lib \ site-packages \ sklearn \ utils \ validation.py”，行 205，以check_consistent_length表示 “样本：％r”％[长度为l的int（l）]）ValueError：找到样本数量不一致的输入变量：[4383，1879]

因此，看来我的import pandas as pd from sklearn.model_selection import train_test_split from sklearn import neighbors # function to perform one hot encoding and dropping the original item # in this case its the part number def encode_and_bind(original_dataframe, feature_to_encode): dummies = pd.get_dummies(original_dataframe[[feature_to_encode]]) res = pd.concat([original_dataframe, dummies], axis=1) res = res.drop([feature_to_encode], axis=1) return(res) # read in data from csv data = pd.read_csv('export2.csv') # one hot encode the part number new = encode_and_bind(data, 'PART_NO') # create the labels, or field we are trying to estimate label = new['TOTAL_DAYS_TO_COMPLETE'] # remove the header label = label[1:] # create the data, or the data that is to be estimated thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1) # remove the header thedata = thedata[1:] print(label.shape) print(thedata.shape) # # split into training and testing sets train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3) # create a knn model knn = neighbors.KNeighborsRegressor() # fit it with our data knn.fit(train_data, train_classes)和X具有相同的行数（6262），但列数却不同，因为我认为Y应该是您要预测的标签或值的一列。

如何使用Y给我提供可以用于KNN回归器的训练和测试数据集？

Answer 1

据我所知，您已经切换了train_test_split的输出。

该函数按顺序返回：训练功能，测试功能，训练标签，测试标签。

常见的命名约定是X_train, X_test, y_train, y_test=...，其中X是要素（列或要素），而y y是目标（标签或我假设是您的代码）

您似乎想让它返回，X_train, y_train, X_test, y_test

尝试一下，看看它是否对您有用：

train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)

train_test_split产生不一致的样本

1 个答案: