Question

当我用sklearn.__version__ 0.15.0运行此代码时，我得到一个奇怪的结果：

import numpy as np
from scipy import sparse
from sklearn.decomposition import RandomizedPCA

a = np.array([[1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 1, 0, 0, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0],
              [0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 0, 1, 0, 1, 1, 1, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0],
              [1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0]])

s = sparse.csr_matrix(a)

pca = RandomizedPCA(n_components=20)
pca.fit_transform(s)

我得到0.15.0：

>>> pca.explained_variance_ratio_.sum()
>>> 2.1214285714285697

以'0.14.1'得到：

>>> pca.explained_variance_ratio_.sum()
>>> 0.99999999999999978

The sum should not be greater than 1

每个所选组件解释的差异百分比。 ķ 未设置然后存储所有组件并解释总和方差等于1.0

这里发生了什么？

Answer 1

0.14.1中的行为是一个错误，因为explained_variance_ratio_.sum()过去总是返回1.0，而不管要提取的组件数（截断）。在0.15.0中，这对于密集阵列是固定的，如下所示：

>>> RandomizedPCA(n_components=3).fit(a).explained_variance_ratio_.sum()
0.86786547849848206
>>> RandomizedPCA(n_components=4).fit(a).explained_variance_ratio_.sum()
0.95868429631268515
>>> RandomizedPCA(n_components=5).fit(a).explained_variance_ratio_.sum()
1.0000000000000002

您的数据排名为5（100％的差异由5个组成部分解释）。

如果您尝试在稀疏矩阵上调用RandomizedPCA，您将获得：

DeprecationWarning: Sparse matrix support is deprecated and will be dropped in 0.16. Use TruncatedSVD instead.

RandomizedPCA对稀疏数据的使用是不正确的，因为我们不能在不破坏稀疏性的情况下使数据居中，这可能会在实际大小的稀疏数据上炸毁内存。然而，PCA需要居中。

TruncatedSVD将为稀疏数据提供正确解释的方差比（但请记住，它与密集数据上的PCA不完全相同）：

>>> TruncatedSVD(n_components=3).fit(s).explained_variance_ratio_.sum()
0.67711305361490826
>>> TruncatedSVD(n_components=4).fit(s).explained_variance_ratio_.sum()
0.8771350212934137
>>> TruncatedSVD(n_components=5).fit(s).explained_variance_ratio_.sum()
0.95954459082530097

在sklearn 0.15.0中，随机化PCA .explained_variance_ratio_总和大于1

1 个答案: