Python scikit-learn PCA在历史VaR计算中增加缺失数据

时间:2015-05-19 12:45:22

标签: python scikit-learn pca missing-data

我有一系列时间序列数据,我想用它来计算大型股票投资组合的历史VaR。

投资组合中有大量缺少时间序列数据的工具,我需要一种系统的方法来产生合理的缺失值。

我正在考虑PCA在有足够数据来计算因子暴露并尝试以下Python实现(Carol Alexander)的情况下增加缺失数据:

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
from sklearn.pipeline import make_pipeline

# get index time series returns
df = pd.read_csv('IndexData.csv',index_col=0)
df = df.fillna(method='ffill').pct_change().dropna(how='all') 
rics = df.columns.tolist()

# add 'missing' data
df['GDAXI'].iloc[0:30] = None

# see stack overflow reference below
pipeline = make_pipeline(StandardScaler(), PCA(n_components = len(rics) - 1))

# Step 1 - PCA for sub-period with GDAXI data - training period
dfSub = df.iloc[30:]
pipeline.fit(np.array(dfSub))
sub_components = pipeline._final_estimator.components_

# Step 2 - PCA for entire period with no GDAXI - 
dfFull = df.loc[:,df.columns != 'GDAXI']
full_transf = pipeline.fit_transform(np.array(dfFull))

# Step 3 - Apply missing asset factor exposures in stage 1 to stage 2
#          to augment missing data
synthetic =  np.dot(full_transf, sub_components[:,rics.index('GDAXI')])

# rescaling??
df['GDAXI'].iloc[0:30]  = synthetic[0:30] 

附带的示例假设IndexData.csv包含多个欧洲指数(包括DAX)的价格数据。我希望能够在相当高度相关的国家/部门篮子上运作。

问题

  1. 我将使用什么sigma和mean来重新缩放计算的回报?
  2. 在替代Python库中是否已经有功能可以执行此操作?
  3. [尝试2]

    参考

    Sebastian Raschka - Implementing PCA in Python Step-By-Step

    stackOverflow - How to normalize with pca and scikit-learn

    scikitlearn.org

0 个答案:

没有答案