为什么数据来自sklearn PCA多个与原始数据不同的pca_components

时间:2018-02-11 09:36:33

标签: python machine-learning scikit-learn pca

我现在正在尝试分解数据。

这是我的代码:

import xlrd
import xlrd
import xlwt
import numpy as np
from sklearn.decomposition import PCA
import matplotlib.pyplot as plt
data = xlrd.open_workbook('x.xlsx')
sh=data.sheet_by_index(1)
num_rows = sh.nrows -1
num_cells = sh.ncols -1
inputData = np.empty([sh.nrows - 1, sh.ncols])
curr_row = -1
while curr_row < num_rows: # for each row
    curr_row += 1
    row = sh.row(curr_row)
    if curr_row > 0: # don't want the first row because those are labels
        for col_ind, el in enumerate(row):
            inputData[curr_row - 1, col_ind] = el.value

print(inputData.shape)
pca = PCA(n_components=3)
newData = pca.fit_transform(inputData)
print(inputData - np.dot(newData, pca.components_))

我认为inputData和np.dot(newData,pca.components_)之间的区别应该非常小,但事实是结果似乎远离原始数据。

你能帮助我吗?

1 个答案:

答案 0 :(得分:2)

您需要添加均值。要进行重建:

rec = np.dot(newData, pca.components_) + pca.mean_

print(inputData - rec)