为什么sklearn的Perceptron会以1的准确度,精度等来预测?

时间:2016-09-11 09:50:41

标签: python scikit-learn

我在我创建的合成数据集上使用sklearn.linear_model.Perceptron。该数据由2个类组成,每个类是具有共同非对角协方差矩阵的多元高斯分布。类的质心足够接近,存在显着的重叠。

mean1 = np.ones((20,))
mean2 = 2 * np.ones((20,))

A = 0.1 * np.random.randn(20,20)
cov = np.dot(A, A.T)

class1 = np.random.multivariate_normal(mean1, cov, 2000)
class2 = np.random.multivariate_normal(mean2, cov, 2000)

class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1)
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1)

class1_train, class1_test = train_test_split(class1, test_size=0.3)
class2_train, class2_test = train_test_split(class2, test_size=0.3)
train = np.concatenate((class1_train, class2_train), axis=0)
test = np.concatenate((class1_test, class2_test), axis=0)

np.random.shuffle(train)
np.random.shuffle(test)
y_train = train[:,20]
x_train = train[:,0:20]
y_test = test[:,20]
x_test = test[:,0:20]

保存这些数据后,我刚刚使用了:

classifier = sklearn.linear_model.Perceptron()
classifier.fit(x_train, y_train)
predicted_test = classifier.predict(x_test)
accuracy = sklearn.metrics.accuracy_score(y_test, predicted_test)
precision = sklearn.metrics.precision_score(y_test, predicted_test)
recall = sklearn.metrics.recall_score(y_test, predicted_test)
f_measure = sklearn.metrics.f1_score(y_test, predicted_test)
print(accuracy, precision, recall, f_measure)  

数据按设计重叠。但是线性分类器能够以精确度,精度等方式完美地进行预测,所有这些都是1。

1 个答案:

答案 0 :(得分:-1)

使用cross_validation.train_test_split的正确方法是为其提供完整的数据集,并让它将数据分区为x_train, x_test, y_train, y_test

以下代码效果更好:

class1 = np.random.multivariate_normal(mean1, cov, 2000)
class2 = np.random.multivariate_normal(mean2, cov, 2000)

class1 = np.concatenate((class1, np.ones((len(class1), 1))), axis=1)
class2 = np.concatenate((class2, 2*np.ones((len(class2), 1))), axis=1)

dataset = np.concatenate((class1, class2), axis=0)

np.random.shuffle(dataset)

x_train, x_test, y_train, y_test = \
    cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)

请注意,Perceptron实际上可以实现数据的100%准确性。尝试添加一些噪音,以便感受它。

例如:

noise = np.random.normal(0,1,(4000, 20))

dataset[:, 0:20] = dataset[:, 0:20] + noise

x_train, x_test, y_train, y_test = \
    cross_validation.train_test_split(dataset[:,:20], dataset[:,20], test_size=0.3)