我使用了sklearn NearestNeighbors
包来对数据集进行分类。它工作正常,直到我尝试在KNN预测中使用'distance'
加权。当我从negative dimensions are not allowed
权重切换到'uniform'
权重时,我收到错误消息'distance'
。 'uniform'
权重工作正常。
错误消息如下:
/home/linux/.local/lib/python2.7/site-packages/sklearn/neighbors/regression.py:160: RuntimeWarning: invalid value encountered in divide
y_pred[:, j] = num / denom
Traceback (most recent call last):
File "analysis.py", line 333, in <module>
main()
File "analysis.py", line 330, in main
ind_test_labels, trainIDs, ind_test_IDs, train_data_original, ind_test_data_original)
File "analysis.py", line 297, in target1
outfile = generate_result(X, feature_names, train_label, outfile, trainIDs, train_labels, best_k, train_data_original, ind_test_data_original)
File "analysis.py", line 130, in generate_result
predicted_label = regressor.predict(test)
File "/home/linux/.local/lib/python2.7/site-packages/sklearn/neighbors/regression.py", line 144, in predict
neigh_dist, neigh_ind = self.kneighbors(X)
File "/home/linux/.local/lib/python2.7/site-packages/sklearn/neighbors/base.py", line 332, in kneighbors
return_distance=return_distance)
File "binary_tree.pxi", line 1313, in sklearn.neighbors.kd_tree.BinaryTree.query (sklearn/neighbors/kd_tree.c:10528)
File "binary_tree.pxi", line 595, in sklearn.neighbors.kd_tree.NeighborsHeap.__init__ (sklearn/neighbors/kd_tree.c:4937)
ValueError: negative dimensions are not allowed
我对错误信息感到困惑。我唯一可以猜到的是训练和测试集中都有相同的实例,因此其距离的倒数会导致除以零误差。但这不太可能发生在6个功能中。
那么有谁可以指出哪里出错了?或者你能指出我可能的方向,以便提供更多细节吗?
------更新--------------------- 我粘贴了出错的代码段。 训练X的读取和操作如下:
train_data = np.loadtext(...)
train_data = preprocessing.scale(train_data);
X_T = train_data.T
X = X_T[[features]].T # features is a tuple that contains columns to be selected for classification
# Then X is passed to generate_result below
#######################################
def generate_result(X, feature_names, train_label, outfile, IDs, labels, k, train_original, ind_test_original):
"""
Purpose: this function does the analysis and outputs the result to file
Inputs: training set, names of selected features, training set labels, file writer stream, IDs of training set,
labels of training set, number of neighbors, original training set, independent test set
Returns: file writer stream
"""
cv = cross_validation.KFold(len(X), 10) # 10-fold cross-validation
feature_str = ','.join(feature_names)
outfile.write('Best K = ' + str(k) + '\n')
outfile.write('10-Fold Cross Validation begins \n')
numCV = 1 #predicted_GFR_str = array_to_string(predicted_label)
for traincv, testcv in cv:
outfile.write('Iteration: ' + str(numCV) + '\n')
outfile.write(complete_features + ',label' + str(numCV) + ',Catagory' + str(numCV) + '\n')
train = X[traincv]
test = X[testcv]
### run regression
regressor = KNeighborsRegressor(n_neighbors = k, weights = 'distance', p = 1)
label_cv_train = train_label[traincv]
regressor.fit(train, label_cv_train)
test = X[testcv]
label_cv_test = train_label[testcv]
predicted_label = regressor.predict(test)# THIS LINE IS CAUSING THE PROBLEM
# more code below not pasted