Question

我有一个带有（n_samples，n_features）=（466000,4338093）的数据集。我想对这些数据执行PCA，以便利用Python的Scikit学习增量PCA。

由于数据量巨大，因此将其分成466个块，每个块有1000个样本，即每个块将具有（n_samples，n_features）=（1000,4338093）。每个块都存储为hickle文件。此外，矩阵采用稀疏格式。

我已将PCA的n_components设置为min（n_samples，n_features），即466000.

以下是我如何处理2个块：

import hickle
from sklearn.decomposition import IncrementalPCA
import os

fv_list = list()

for file_name in os.listdir("PATH_TO_DIR"):
    if file_name.endswith(".hkl"):
        fv_list.append(os.path.join("PATH_TO_DIR", file_name))

data_shape = hickle.load(open(fv_list[0])).shape
ipca = IncrementalPCA(n_components=min(len(fv_list) * data_shape[0], data_shape[1]))

for each_chunk in fv_list:
    part = hickle.load(open(each_chunk))
    ipca.partial_fit(part.todense())

现在，当我调用pca的部分拟合方法时，对于第二次迭代，我得到以下错误：

ValueError: Number of input features has changed from 1000 to 466000 between calls to partial_fit! Try setting n_components to a fixed value.

我很担心为什么会出现这个ValueError。我的做法错了吗？

scikit学习增量pca混乱

0 个答案: