如何在滚动PCA中有效地存储主成分系数?

时间:2019-06-22 19:58:27

标签: python pca rolling-computation

我正在尝试将每20天滚动窗口最后一天的前三个PC系数,得分,特征值和残差(原始数据-重构)存储在20个具有不同经验和108个CDS指数的每日收益数据集上交易日。我正试图在没有scikit.learn的帮助下自己编写整个算法,以用于教学目的。 不幸的是,我似乎无法将PC的系数存储在我先前初始化的DataFrames coeff1roll,coeff2roll,coeff3roll中。其他一切似乎都正常。

请帮助我,并随时让我注意我的程序中的任何其他错误。

这是我的代码:

d = data
t = 20

scores1roll = np.zeros((n-t))  # scores
scores2roll = np.zeros((n-t))
scores3roll = np.zeros((n-t))

coeff1roll = pd.DataFrame(data = np.zeros((n-t, m)))  # loadings
coeff2roll = pd.DataFrame(data = np.zeros((n-t, m)))
coeff3roll = pd.DataFrame(data = np.zeros((n-t, m)))

eigvalroll1 = np.zeros((n-t))  # eigenvalues
eigvalroll2 = np.zeros((n-t))
eigvalroll3 = np.zeros((n-t))

res_roll = pd.DataFrame(data = np.zeros((n-t, m)))  # Residuals


# Loop with PCA step by step

for i in range(n-t):
    droll = d.iloc[i:i+t-1,:]  # window of data
    r,c = droll.shape
    meanroll = np.mean(droll, axis=0)  # mean by column
    stddevroll = np.sqrt(np.var(droll, axis=0))  # standard deviation by col
    dstdrollss = (droll - meanroll)/stddevroll
    # mean should be 0 and std dev 1
    cor_mat = (dstdrollss.T.dot(dstdrollss))/(r - 1)  # correlation matrix
    eigvalues, eigvectors = np.linalg.eig(cor_mat)
    eig_pairs = [(np.abs(eigval[i]), eigvec[:, i]) for i in 
range(len(eigval))]   # associating each eigvec to the correspondent eigval
    eig_pairs.sort(reverse=True)   # sorting according to abs val of eigenval
    loadingsroll = np.hstack((eig_pairs[0][1].reshape(20, 1),
                  eig_pairs[1][1].reshape(20, 1),
                  eig_pairs[2][1].reshape(20, 1)))  # selecting first 3
    scoresroll = dstdrollss.dot(loadingsroll)  # projections on new directions

    coeff1roll.iloc[i, :] = loadingsroll[:, 0] # renamed coeff  
    coeff2roll.iloc[i, :] = loadingsroll[:, 1]
    coeff3roll.iloc[i, :] = loadingsroll[:, 2]

    scores1roll[i] = scoresroll.iloc[0,0]
    scores2roll[i] = scoresroll.iloc[0,1]
    scores3roll[i] = scoresroll.iloc[0,2]

    eigvalroll1[i] = eigvalues[0]
    eigvalroll2[i] = eigvalues[1]
    eigvalroll3[i] = eigvalues[2]

    d_hatroll = scoresroll.dot(loadingsroll.T)  # back to raw normalized data
    d_rawroll = d_hatroll * stddevroll.values + meanroll.values # raw data
    c = dict(zip(d_rawroll.columns, droll.columns))
    resrolling = droll.subtract(d_rawroll.rename(columns=c),
                            axis='column')  # calculate residuals

    res_roll.iloc[i, :] = resrolling.iloc[0, :].values

我希望coeff1roll,coeff2roll,coeff3roll的每个数据点具有不同的值,但是循环返回的值相等的列恰好是最近20天窗口中最近一天的系数的值:

coeff1roll.head()

        0         1         2   ...        17        18       19
0  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
1  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
2  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
3  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368
4  0.223961  0.223526  0.223531  ...  0.223876  0.223852  0.22368

[5 rows x 20 columns]

0 个答案:

没有答案