让我们生成一个数组:
import numpy as np
data = np.arange(30).reshape(10,3)
data=data*data
array([[ 0, 1, 4],
[ 9, 16, 25],
[ 36, 49, 64],
[ 81, 100, 121],
[144, 169, 196],
[225, 256, 289],
[324, 361, 400],
[441, 484, 529],
[576, 625, 676],
[729, 784, 841]])
然后找到协方差矩阵的特征值:
mn = np.mean(data, axis=0)
data -= mn
C = np.cov(data.T)
evals, evecs = la.eig(C)
idx = np.argsort(evals)[::-1]
evecs = evecs[:,idx]
print evecs
array([[-0.53926461, -0.73656433, 0.40824829],
[-0.5765472 , -0.03044111, -0.81649658],
[-0.61382979, 0.67568211, 0.40824829]])
现在让我们对数据运行matplotlib.mlab.PCA函数:
import matplotlib.mlab as mlab
mpca=mlab.PCA(data)
print mpca.Wt
[[ 0.57731894 0.57740574 0.57732612]
[ 0.72184459 -0.03044628 -0.69138514]
[ 0.38163232 -0.81588947 0.43437443]]
为什么两个矩阵不同?我认为在找到PCA时,首先必须找到协方差矩阵的特征向量,并且这将与权重完全相等。
答案 0 :(得分:6)
您需要规范化您的数据,而不仅仅是居中,并且np.linalg.eig
的输出必须转换为与mlab.PCA
的输出匹配:
>>> n_data = (data - data.mean(axis=0)) / data.std(axis=0)
>>> evals, evecs = np.linalg.eig(np.cov(n_data.T))
>>> evecs = evecs[:, np.argsort(evals)[::-1]].T
>>> mlab.PCA(data).Wt
array([[ 0.57731905, 0.57740556, 0.5773262 ],
[ 0.72182079, -0.03039546, -0.69141222],
[ 0.38167716, -0.8158915 , 0.43433121]])
>>> evecs
array([[-0.57731905, -0.57740556, -0.5773262 ],
[-0.72182079, 0.03039546, 0.69141222],
[ 0.38167716, -0.8158915 , 0.43433121]])