Question

我的问题主要来自此帖子：https://stats.stackexchange.com/questions/53/pca-on-correlation-or-covariance

在本文中，作者绘制了每个变量的向量方向和长度。根据我的理解，执行PCA之后。我们所得到的只是特征向量和特征值。对于维数为M x N的数据集，每个特征值应为1 x N的向量。因此，我的问题是向量的长度可能是特征值，但如何为每个变量数学计算向量的方向？向量长度的物理含义是什么？

如果可以的话，我是否可以在python中使用scikit PCA函数做类似的工作？

谢谢！

Answer 1

此图称为 biplot ，它对于了解PCA结果非常有用。 向量的长度，就是每个特征/变量在每个主成分（也称为PCA加载）上具有的值。

示例：

可以通过print(pca.components_)访问这些加载。使用 Iris数据集，加载为：

  [[ 0.52106591, -0.26934744,  0.5804131 ,  0.56485654],
   [ 0.37741762,  0.92329566,  0.02449161,  0.06694199],
   [-0.71956635,  0.24438178,  0.14212637,  0.63427274],
   [-0.26128628,  0.12350962,  0.80144925, -0.52359713]])

在这里，每一行是一台PC，每一列对应一个变量/功能。 特征/变量1在PC1上的值为0.5223，在PC2上的值为0.3723。这些值用于绘制在双图中看到的矢量。请参见下面的Var1坐标。正是那些（以上）值！

最后，要在python中创建此图，您可以使用sklearn使用它：

import numpy as np
import matplotlib.pyplot as plt
from sklearn import datasets
import pandas as pd
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA

iris = datasets.load_iris()
X = iris.data
y = iris.target

#In general it is a good idea to scale the data
scaler = StandardScaler()
scaler.fit(X)
X=scaler.transform(X)

pca = PCA()
pca.fit(X,y)
x_new = pca.transform(X)   

def myplot(score,coeff,labels=None):
    xs = score[:,0]
    ys = score[:,1]
    n = coeff.shape[0]

    plt.scatter(xs ,ys, c = y) #without scaling
    for i in range(n):
        plt.arrow(0, 0, coeff[i,0], coeff[i,1],color = 'r',alpha = 0.5)
        if labels is None:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, "Var"+str(i+1), color = 'g', ha = 'center', va = 'center')
        else:
            plt.text(coeff[i,0]* 1.15, coeff[i,1] * 1.15, labels[i], color = 'g', ha = 'center', va = 'center')

plt.xlabel("PC{}".format(1))
plt.ylabel("PC{}".format(2))
plt.grid()

#Call the function. 
myplot(x_new[:,0:2], pca. components_) 
plt.show()

另请参阅这篇文章：https://stackoverflow.com/a/50845697/5025009

Answer 2

尝试使用“ pca”库。这将绘制解释的方差，并创建一个双图。

pip install pca

一个小例子：

from pca import pca

# Initialize to reduce the data up to the number of componentes that explains 95% of the variance.
model = pca(n_components=0.95)

# Or reduce the data towards 2 PCs
model = pca(n_components=2)

# Load example dataset
import pandas as pd
import sklearn
from sklearn.datasets import load_iris
X = pd.DataFrame(data=load_iris().data, columns=load_iris().feature_names, index=load_iris().target)

# Fit transform
results = model.fit_transform(X)

# Plot explained variance
fig, ax = model.plot()

explained variance PCs

# Scatter first 2 PCs
fig, ax = model.scatter()

# Make biplot with the number of features
fig, ax = model.biplot(n_feat=4)

PCA biplot

结果是一个字典，其中包含许多PC，负载等的统计信息。

执行PCA后如何绘制每个变量的主向量？

2 个答案: