Python中PCA的累积解释方差

时间:2017-04-11 19:52:57

标签: python r pandas scikit-learn pca

我有一个简单的R脚本,用于在一个小数据帧上运行FactoMineR's PCA,以便找到为每个变量解释的累积差异百分比:

library(FactoMineR)
a <- c(1, 2, 3, 4, 5)
b <- c(4, 2, 9, 23, 3)
c <- c(9, 8, 7, 6, 6)
d <- c(45, 36, 74, 35, 29)

df <- data.frame(a, b, c, d)

df_pca <- PCA(df, ncp = 4, graph=F)
print(df_pca$eig$`cumulative percentage of variance`)

返回:

> print(df_pca$eig$`cumulative percentage of variance`)
[1]  58.55305  84.44577  99.86661 100.00000

我尝试使用scikit-learn's decomposition package在Python中执行相同的操作,如下所示:

import pandas as pd
from sklearn import decomposition, linear_model

a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]

df = pd.DataFrame({'a': a,
                  'b': b,
                  'c': c, 
                  'd': d})

pca = decomposition.PCA(n_components = 4)
pca.fit(df)
transformed_pca = pca.transform(df)

# sum cumulative variance from each var
cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
    if i == 0:
        cum_explained_var.append(pca.explained_variance_ratio_[i])
    else:
        cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                 cum_explained_var[i-1])
print(cum_explained_var)

但结果是:

[0.79987089715487936, 0.99224337624509307, 0.99997254568237226, 1.0]

正如您所看到的,两者都正确地加起来达到100%,但似乎每个变量的贡献在R和Python版本之间有所不同。有谁知道这些差异来自何处或如何在Python中正确复制R结果?

编辑:感谢Vlo,我现在知道差异源于FactoMineR PCA功能默认情况下缩放数据。通过使用sklearn预处理包(pca_data = preprocessing.scale(df))在运行PCA之前扩展我的数据,我的结果与

匹配

1 个答案:

答案 0 :(得分:1)

感谢Vlo,我了解到FactoMineR PCA功能和sklearn PCA功能之间的区别在于FactoMineR默认会缩放数据。只需在我的python代码中添加缩放功能,我就可以重现结果。

import pandas as pd
from sklearn import decomposition, preprocessing

a = [1, 2, 3, 4, 5]
b = [4, 2, 9, 23, 3]
c = [9, 8, 7, 6, 6]
d = [45, 36, 74, 35, 29]
e = [35, 84, 3, 54, 68]


df = pd.DataFrame({'a': a,
                  'b': b,
                  'c': c, 
                  'd': d})


pca_data = preprocessing.scale(df)

pca = decomposition.PCA(n_components = 4)
pca.fit(pca_data)
transformed_pca = pca.transform(pca_data)

cum_explained_var = []
for i in range(0, len(pca.explained_variance_ratio_)):
    if i == 0:
        cum_explained_var.append(pca.explained_variance_ratio_[i])
    else:
        cum_explained_var.append(pca.explained_variance_ratio_[i] + 
                                 cum_explained_var[i-1])

print(cum_explained_var)

输出:

[0.58553054049052267, 0.8444577483783724, 0.9986661265687754, 0.99999999999999978]