Question

我正在试验距离上的权重影响kNN算法性能的方式，以及我正在使用虹膜数据集的可重复示例。

令我惊讶的是，加权2个预测因子比其他2个预测因子高100倍，与未加权模型产生相同的预测。这个相当违反直觉的发现是什么？

我的代码如下：

X_original = iris['data']
Y = iris['target']

sc = StandardScaler() # Defines the parameters of the Scaler

X = sc.fit_transform(X_original)  # Transforms the original data to standardized data and returns them

from sklearn.model_selection import StratifiedShuffleSplit

sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)

split = sss.split(X, Y)

s = list(split)

train_index = s[0][0]

test_index = s[0][1]

X_train = X[train_index, ] 

X_test = X[test_index, ] 

Y_train = Y[train_index] 

Y_test = Y[test_index] 

from sklearn.neighbors import KNeighborsClassifier

knn = KNeighborsClassifier(n_neighbors = 6)

iris_fit = knn.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                  # All the data should be numeric
                                                  # There should be no NaNs

predictions_w1 = knn.predict(X_test)

weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)

knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2, 
                           metric_params={'w': weights})

iris_fit_w = knn_w.fit(X_train, Y_train)  # The data can be passed as numpy arrays or pandas dataframes/series.
                                                  # All the data should be numeric
                                                  # There should be no NaNs

predictions_w100 = knn_w.predict(X_test)

(predictions_w1 != predictions_w100).sum()
0

Answer 1

它们并不总是相同，在您的列车测试分组中添加随机状态，您将看到它对于不同值的变化。

 StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2, random_state=3)

此外，在第3（花瓣长度）和第4（花瓣宽度）特征上具有如此极端权重的加权Minkowski距离基本上给出了相同的结果，就好像您只使用未加权的Minkowski在这两个特征上运行KNN一样。而且因为它们看起来非常有用，所以与考虑所有4个特征的情况相比，你得到非常相似的结果也就不足为奇了。请参阅下面的维基图片

重度加权距离返回与具有虹膜数据集的knn中的常规距离相同的结果

1 个答案: