我正在使用sklearn
的{{1}}创建数据的训练集和测试集。
我的脚本在下面:
train_test_split
运行它,我得到以下信息:
C:\ Users \ jerry \ Desktop> python test.py(6262,)(6262,253)追溯 (最近一次通话最近):文件“ test.py”,位于第37行 knn.fit(train_data,train_classes)文件“ C:\ Python367-64 \ lib \ site-packages \ sklearn \ neighbors \ base.py”,行 872,适合 X,y = check_X_y(X,y,“ csr”,multi_output = True)文件“ C:\ Python367-64 \ lib \ site-packages \ sklearn \ utils \ validation.py”,行 729,在check_X_y中 check_consistent_length(X,y)文件“ C:\ Python367-64 \ lib \ site-packages \ sklearn \ utils \ validation.py”,行 205,以check_consistent_length表示 “样本:%r”%[长度为l的int(l)])ValueError:找到样本数量不一致的输入变量:[4383,1879]
因此,看来我的import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn import neighbors
# function to perform one hot encoding and dropping the original item
# in this case its the part number
def encode_and_bind(original_dataframe, feature_to_encode):
dummies = pd.get_dummies(original_dataframe[[feature_to_encode]])
res = pd.concat([original_dataframe, dummies], axis=1)
res = res.drop([feature_to_encode], axis=1)
return(res)
# read in data from csv
data = pd.read_csv('export2.csv')
# one hot encode the part number
new = encode_and_bind(data, 'PART_NO')
# create the labels, or field we are trying to estimate
label = new['TOTAL_DAYS_TO_COMPLETE']
# remove the header
label = label[1:]
# create the data, or the data that is to be estimated
thedata = new.drop('TOTAL_DAYS_TO_COMPLETE', axis=1)
# remove the header
thedata = thedata[1:]
print(label.shape)
print(thedata.shape)
# # split into training and testing sets
train_data, train_classes, test_data, test_classes = train_test_split(thedata, label, test_size = 0.3)
# create a knn model
knn = neighbors.KNeighborsRegressor()
# fit it with our data
knn.fit(train_data, train_classes)
和X
具有相同的行数(6262),但列数却不同,因为我认为Y
应该是您要预测的标签或值的一列。
如何使用Y
给我提供可以用于KNN回归器的训练和测试数据集?
答案 0 :(得分:2)
据我所知,您已经切换了train_test_split
的输出。
该函数按顺序返回:训练功能,测试功能,训练标签,测试标签。
常见的命名约定是X_train, X_test, y_train, y_test=...
,其中X
是要素(列或要素),而y
y是目标(标签或我假设是您的代码)
您似乎想让它返回,X_train, y_train, X_test, y_test
尝试一下,看看它是否对您有用:
train_data, test_data, train_classes, test_classes = train_test_split(thedata, label, test_size = 0.3)