train_test_split产生不一致的样本

时间:2019-07-02 16:59:15

标签: python python-3.x scikit-learn

我正在使用sklearn的{​​{1}}创建数据的训练集和测试集。

我的脚本在下面:

train_test_split

运行它,我得到以下信息:

  

C:\ Users \ jerry \ Desktop> python test.py(6262,)(6262,253)追溯   (最近一次通话最近):文件“ test.py”,位于第37行       knn.fit(train_data,train_classes)文件“ C:\ Python367-64 \ lib \ site-packages \ sklearn \ neighbors \ base.py”,行   872,适合       X,y = check_X_y(X,y,“ csr”,multi_output = True)文件“ C:\ Python367-64 \ lib \ site-packages \ sklearn \ utils \ validation.py”,行   729,在check_X_y中       check_consistent_length(X,y)文件“ C:\ Python367-64 \ lib \ site-packages \ sklearn \ utils \ validation.py”,行   205,以check_consistent_length表示       “样本:%r”%[长度为l的int(l)])ValueError:找到样本数量不一致的输入变量:[4383,1879]

因此,看来我的import pandas as pd from sklearn.model_selection import train_test_split from sklearn import neighbors # function to perform one hot encoding and dropping the original item # in this case its the part number def encode_and_bind(original_dataframe, feature_to_encode): dummies = pd.get_dummies(original_dataframe[[feature_to_encode]]) res = pd.concat([original_dataframe, dummies], axis=1) res = res.drop([feature_to_encode], axis=1) return(res) # read in data from csv data = pd.read_csv('export2.csv') # one hot encode the part number new = encode_and_bind(data, 'PART_NO') # create the labels, or field we are trying to estimate label = new['TOTAL_DAYS_TO_COMPLETE'] # remove the header label = label[1:] # create the data, or the data that is to be estimated thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1) # remove the header thedata = thedata[1:] print(label.shape) print(thedata.shape) # # split into training and testing sets train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3) # create a knn model knn = neighbors.KNeighborsRegressor() # fit it with our data knn.fit(train_data, train_classes) X具有相同的行数(6262),但列数却不同,因为我认为Y应该是您要预测的标签或值的一列。

如何使用Y给我提供可以用于KNN回归器的训练和测试数据集?

1 个答案:

答案 0 :(得分:2)

据我所知,您已经切换了train_test_split的输出。

该函数按顺序返回:训练功能,测试功能,训练标签,测试标签。

常见的命名约定是X_train, X_test, y_train, y_test=...,其中X是要素(列或要素),而y y是目标(标签或我假设是您的代码)

您似乎想让它返回,X_train, y_train, X_test, y_test

尝试一下,看看它是否对您有用:

train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)