我想在使用 PCA 预处理输入数据的模型中解释回归模型权重。实际上,我有100多个高度相关的输入维度,因此我知道PCA很有用。但是,为便于说明,我将使用 Iris 数据集。
下面的 sklearn 代码说明了我的问题:
import numpy as np
import sklearn.datasets, sklearn.decomposition
from sklearn.linear_model import LinearRegression
# load data
X = sklearn.datasets.load_iris().data
w = np.array([0.3, 10, -0.1, -0.01])
Y = np.dot(X, w)
# set number of components to keep from PCA
n_components = 4
# reconstruct w
reg = LinearRegression().fit(X, Y)
w_hat = reg.coef_
print(w_hat)
# apply PCA
pca = sklearn.decomposition.PCA(n_components=n_components)
pca.fit(X)
X_trans = pca.transform(X)
# reconstruct w
reg_trans = LinearRegression().fit(X_trans, Y)
w_trans_hat = np.dot(reg_trans.coef_, pca.components_)
print(w_trans_hat)
运行此代码,可以看到砝码可以很好地再现。
但是,如果我将分量的数量设置为3(即n_components = 3
),那么打印出的权重就会与真实的权重大大偏离。
我误会了如何转换这些权重?还是因为PCA的信息损失从4个减少到3个?
答案 0 :(得分:1)
我认为这很好,只是我在看w_trans_hat
而不是重建的Y
:
import numpy as np
import sklearn.datasets, sklearn.decomposition
from sklearn.linear_model import LinearRegression
# load data
X = sklearn.datasets.load_iris().data
# create fake loadings
w = np.array([0.3, 10, -0.1, -0.01])
# centre X
X = np.subtract(X, np.mean(X, 0))
# calculate Y
Y = np.dot(X, w)
# set number of components to keep from PCA
n_components = 3
# reconstruct w using linear regression
reg = LinearRegression().fit(X, Y)
w_hat = reg.coef_
print(w_hat)
# apply PCA
pca = sklearn.decomposition.PCA(n_components=n_components)
pca.fit(X)
X_trans = pca.transform(X)
# regress Y on principal components
reg_trans = LinearRegression().fit(X_trans, Y)
# reconstruct Y using regressed weights and transformed X
Y_trans = np.dot(X_trans, reg_trans.coef_)
# show MSE to original Y
print(np.mean((Y - Y_trans) ** 2))
# show w implied by reduced model in original space
w_trans_hat = np.dot(reg_trans.coef_, pca.components_)
print(w_trans_hat)