我正在试验距离上的权重影响kNN算法性能的方式,以及我正在使用虹膜数据集的可重复示例。
令我惊讶的是,加权2个预测因子比其他2个预测因子高100倍,与未加权模型产生相同的预测。这个相当违反直觉的发现是什么?
我的代码如下:
X_original = iris['data']
Y = iris['target']
sc = StandardScaler() # Defines the parameters of the Scaler
X = sc.fit_transform(X_original) # Transforms the original data to standardized data and returns them
from sklearn.model_selection import StratifiedShuffleSplit
sss = StratifiedShuffleSplit(n_splits = 1, train_size = 0.8, test_size = 0.2)
split = sss.split(X, Y)
s = list(split)
train_index = s[0][0]
test_index = s[0][1]
X_train = X[train_index, ]
X_test = X[test_index, ]
Y_train = Y[train_index]
Y_test = Y[test_index]
from sklearn.neighbors import KNeighborsClassifier
knn = KNeighborsClassifier(n_neighbors = 6)
iris_fit = knn.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w1 = knn.predict(X_test)
weights = np.array([1, 1, 100, 100])
weights =weights/np.sum(weights)
knn_w = KNeighborsClassifier(n_neighbors = 6, metric='wminkowski', p=2,
metric_params={'w': weights})
iris_fit_w = knn_w.fit(X_train, Y_train) # The data can be passed as numpy arrays or pandas dataframes/series.
# All the data should be numeric
# There should be no NaNs
predictions_w100 = knn_w.predict(X_test)
(predictions_w1 != predictions_w100).sum()
0
答案 0 :(得分:0)