LDA:类中的不同样本大小

时间:2016-09-20 13:59:17

标签: python scikit-learn

我正在尝试将LDA模型放在具有不同类的样本大小的数据集上。

TL; DR

如果我使用没有相同数量的样本的类训练分类器,那么<​​p> lda.predict()无法正常工作。

长解释

我有7个课程,每个课程有3个样本,一个课程只有2个样本:

tortle    -14,6379  -17,3731
tortle    -14,9339  -17,4379
bull      -11,7777  -13,1383
bull      -11,6207  -13,4596
bull      -11,4616  -12,9811
hawk      -9,01229  -12,777
hawk      -8,88177  -12,4383
hawk      -8,93559  -13,0143
pikachu   -6,50024  -7,92564
pikachu   -6,00418  -8,59305
pikachu   -6,0769   -6,00419
pizza     2,02872   3,07972
pizza     2,084     2,73762
pizza     2,20269   2,90577
sangoku   -3,14428  -3,14415
sangoku   -4,02675  -3,55358
sangoku   -3,26119  -2,95265
charizard -0,159746 0,434694
charizard 0,0191964 0,514596
charizard 0,0422884 0,512207
tomatoe   -1,15295  -2,09673
tomatoe   -0,562748 -1,80215
tomatoe   -0,716941 -1,83503

这是一个有效的例子:

#!/usr/bin/python
# coding: utf-8

from matplotlib import pyplot as plt
import numpy as np
from sklearn import preprocessing
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis as LDA
from sklearn import cross_validation

analytes = ['tortle', 'tortle', 'bull', 'bull', 'bull', 'hawk', 'hawk', 'hawk', 'pikachu', 'pikachu', 'pikachu', 'pizza', 'pizza', 'pizza', 'sangoku', 'sangoku', 'sangoku', 'charizard', 'charizard', 'charizard', 'tomatoe', 'tomatoe', 'tomatoe']

# Transform the names of the samples into integers
lb = preprocessing.LabelEncoder().fit(analytes)
analytes = lb.transform(analytes)


# Create an array w/ the measurements
dimensions = [[-14.6379, -14.9339, -11.7777, -11.6207, -11.4616, -9.01229, -8.88177, -8.93559, -6.50024, -6.00418, -6.0769, 2.02872, 2.084, 2.20269, -3.14428, -4.02675, -3.26119, -0.159746, 0.0191964, 0.0422884, -1.15295, -0.562748, -0.716941], [-17.3731, -17.4379, -13.1383, -13.4596, -12.9811, -12.777, -12.4383, -13.0143, -7.92564, -8.59305, -6.00419, 3.07972, 2.73762, 2.90577, -3.14415, -3.55358, -2.95265, 0.434694, 0.514596, 0.512207, -2.09673, -1.80215, -1.83503]]

# Transform the array of the results
all_samples = np.array(dimensions).T

# Normalize the data
preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True,
                    copy=False)

# Train the LDA classifier. Use the eigen solver
lda = LDA(solver='eigen', n_components=2)
transformed = lda.fit_transform(all_samples, analytes)


# Fit the LDA classifier on the new subspace
lda.fit(transformed, analytes)

fig = plt.figure()

plt.plot(transformed[:, 0], transformed[:, 1], 'o')

# Get the limits of the graph. Used for adapted color areas
x_min, x_max = fig.axes[0].get_xlim()
y_min, y_max = fig.axes[0].get_ylim()

# Step size of the mesh. Decrease to increase the quality of the VQ.
# point in the mesh [x_min, m_max]x[y_min, y_max].
# h = 0.01
h = 0.001

# Create a grid for incoming plottings
xx, yy = np.meshgrid(np.arange(x_min, x_max, h), np.arange(y_min, y_max, h))

# Predict the class for each unit of the grid
Z = lda.predict(np.c_[xx.ravel(), yy.ravel()])

Z = Z.reshape(xx.shape)

# Plot the areas
plt.imshow(Z, extent=(x_min, x_max, y_min, y_max), aspect='auto', origin='lower', alpha=0.6)

plt.show()

这是输出:

enter image description here

正如你所看到的,右边的两个点与紫色的同化,而它们不应该同化。它们应该属于黄色类,如果我增加图形的限制,它就变得可见:

enter image description here

基本上,我的问题是如果我使用不具有相同数量的样本的类训练分类器,则lda.predict()无法正常工作。

有解决方法吗?

1 个答案:

答案 0 :(得分:0)

我花了一段时间才想出这个。预处理步骤负责错误分类。改变

preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True,
                copy=False)

preprocessing.scale(all_samples, axis=0, with_mean=True, with_std=True)

解决了我的问题。但是,我的数据现在没有以相同的方式缩放。