我正在尝试将LDA模型放在具有不同类的样本大小的数据集上。
我有7个课程,每个课程有3个样本,一个课程只有2个样本:
tortle -14,6379 -17,3731
tortle -14,9339 -17,4379
bull -11,7777 -13,1383
bull -11,6207 -13,4596
bull -11,4616 -12,9811
hawk -9,01229 -12,777
hawk -8,88177 -12,4383
hawk -8,93559 -13,0143
pikachu -6,50024 -7,92564
pikachu -6,00418 -8,59305
pikachu -6,0769 -6,00419
pizza 2,02872 3,07972
pizza 2,084 2,73762
pizza 2,20269 2,90577
sangoku -3,14428 -3,14415
sangoku -4,02675 -3,55358
sangoku -3,26119 -2,95265
charizard -0,159746 0,434694
charizard 0,0191964 0,514596
charizard 0,0422884 0,512207
tomatoe -1,15295 -2,09673
tomatoe -0,562748 -1,80215
tomatoe -0,716941 -1,83503
这是一个有效的例子:
#!/usr/bin/python
# coding: utf-8
from matplotlib import pyplot as plt
import numpy as np
from sklearn import preprocessing
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn import cross_validation
analytes = ['tortle', 'tortle', 'bull', 'bull', 'bull', 'hawk', 'hawk', 'hawk', 'pikachu', 'pikachu', 'pikachu', 'pizza', 'pizza', 'pizza', 'sangoku', 'sangoku', 'sangoku', 'charizard', 'charizard', 'charizard', 'tomatoe', 'tomatoe', 'tomatoe']
# Transform the names of the samples into integers
lb = preprocessing.LabelEncoder().fit(analytes)
analytes = lb.transform(analytes)
# Create an array w/ the measurements
dimensions = [[-14.6379, -14.9339, -11.7777, -11.6207, -11.4616, -9.01229, -8.88177, -8.93559, -6.50024, -6.00418, -6.0769, 2.02872, 2.084, 2.20269, -3.14428, -4.02675, -3.26119, -0.159746, 0.0191964, 0.0422884, -1.15295, -0.562748, -0.716941], [-17.3731, -17.4379, -13.1383, -13.4596, -12.9811, -12.777, -12.4383, -13.0143, -7.92564, -8.59305, -6.00419, 3.07972, 2.73762, 2.90577, -3.14415, -3.55358, -2.95265, 0.434694, 0.514596, 0.512207, -2.09673, -1.80215, -1.83503]]
# Transform the array of the results
all_samples = np.array(dimensions).T
# Normalize the data
preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True,
copy=False)
# Train the LDA classifier. Use the eigen solver
lda = LDA(solver='eigen', n_components=2)
transformed = lda.fit_transform(all_samples, analytes)
# Fit the LDA classifier on the new subspace
lda.fit(transformed, analytes)
fig = plt.figure()
plt.plot(transformed[:, 0], transformed[:, 1], 'o')
# Get the limits of the graph. Used for adapted color areas
x_min, x_max = fig.axes[0].get_xlim()
y_min, y_max = fig.axes[0].get_ylim()
# Step size of the mesh. Decrease to increase the quality of the VQ.
# point in the mesh [x_min, m_max]x[y_min, y_max].
# h = 0.01
h = 0.001
# Create a grid for incoming plottings
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))
# Predict the class for each unit of the grid
Z = lda.predict(np.c_[xx.ravel(), yy.ravel()])
Z = Z.reshape(xx.shape)
# Plot the areas
plt.imshow(Z, extent=(x_min, x_max, y_min, y_max), aspect='auto', origin='lower', alpha=0.6)
plt.show()
这是输出:
正如你所看到的,右边的两个点与紫色的同化,而它们不应该同化。它们应该属于黄色类,如果我增加图形的限制,它就变得可见:
基本上,我的问题是如果我使用不具有相同数量的样本的类训练分类器,则lda.predict()无法正常工作。
有解决方法吗?
答案 0 :(得分:0)
我花了一段时间才想出这个。预处理步骤负责错误分类。改变
preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True,
copy=False)
到
preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True)
解决了我的问题。但是,我的数据现在没有以相同的方式缩放。