Python相关矩阵3d数据帧

时间:2015-09-21 13:27:22

标签: python-3.x pandas correlation

我在SQL Server中有按日期和资产ID的历史返回表,如下所示:

[Date] [Asset] [1DRet]
jan   asset1   0.52
jan   asset2   0.12
jan   asset3   0.07
feb   asset1   0.41
feb   asset2   0.33
feb   asset3   0.21

...

所以我需要计算所有资产组合的给定日期范围的相关矩阵:A1,A2; A1,A3; A2,A3

我正在使用pandas并在我的SQL中选择我在哪里过滤日期范围并按日期排序。

我尝试使用pandas df.corr(),numpy.corrcoef和Scipy来做这件事但是我的n变量数据框无法做到这一点

我看到了一些示例,但它始终适用于每列有资产且每天有一行的数据框。

这是我的代码块,我正在执行此操作:

qryRet = "Select * from IndexesValue where Date > '20100901' and Date < '20150901' order by Date"

result = conn.execute(qryRet)

df = pd.DataFrame(data=list(result),columns=result.keys())

df1d = df[['Date','Id_RiskFactor','1DReturn']]

corr = df1d.set_index(['Date','Id_RiskFactor']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
print(corr)

conn.close()

为此,我收到了这个消息:

corr.columns = corr.columns.droplevel()
AttributeError: 'Index' object has no attribute 'droplevel'

**Print(df1d.head())**
         Date  Id_RiskFactor         1DReturn
0  2010-09-02            149            0E-12
1  2010-09-02            150  -0.004242875148
2  2010-09-02             33   0.000590000011
3  2010-09-02             28   0.000099999997
4  2010-09-02             34  -0.000010000000

**print(df.head())**
         Date  Id_RiskFactor           Value         1DReturn         5DReturn
0  2010-09-02            149  0.040096000000            0E-12            0E-12
1  2010-09-02            150  1.736700000000  -0.004242875148  -0.013014321215
2  2010-09-02             33  2.283000000000   0.000590000011   0.001260000048
3  2010-09-02             28  2.113000000000   0.000099999997   0.000469999999
4  2010-09-02             34  0.615000000000  -0.000010000000   0.000079999998

**print(corr.columns)**
Index([], dtype='object')

1 个答案:

答案 0 :(得分:1)

创建示例DataFrame:

import pandas as pd
import numpy as np

df = pd.DataFrame({'daily_return': np.random.random(15), 
                   'symbol': ['A'] * 5 + ['B'] * 5 + ['C'] * 5, 
                   'date': np.tile(pd.date_range('1-1-2015', periods=5), 3)})

>>> df
    daily_return       date symbol
0       0.011467 2015-01-01      A
1       0.613518 2015-01-02      A
2       0.334343 2015-01-03      A
3       0.371809 2015-01-04      A
4       0.169016 2015-01-05      A
5       0.431729 2015-01-01      B
6       0.474905 2015-01-02      B
7       0.372366 2015-01-03      B
8       0.801619 2015-01-04      B
9       0.505487 2015-01-05      B
10      0.946504 2015-01-01      C
11      0.337204 2015-01-02      C
12      0.798704 2015-01-03      C
13      0.311597 2015-01-04      C
14      0.545215 2015-01-05      C

我假设您已经过滤了相关日期的DataFrame。然后,您需要一个数据透视表,其中您将唯一日期作为索引,将符号作为单独的列,每日返回值作为值。最后,您在结果上调用corr()

corr = df.set_index(['date','symbol']).unstack().corr()
corr.columns = corr.columns.droplevel()
corr.index = corr.columns.tolist()  
corr.index.name = 'symbol_1'
corr.columns.name = 'symbol_2'
>>> corr
symbol_2         A         B         C
symbol_1                              
A         1.000000  0.188065 -0.745115
B         0.188065  1.000000 -0.688808
C        -0.745115 -0.688808  1.000000

您可以根据日期选择DataFrame的子集,如下所示:

start_date = pd.Timestamp('2015-1-4')
end_date = pd.Timestamp('2015-1-5')
>>> df.loc[df.date.between(start_date, end_date), :]
    daily_return       date symbol
3       0.371809 2015-01-04      A
4       0.169016 2015-01-05      A
8       0.801619 2015-01-04      B
9       0.505487 2015-01-05      B
13      0.311597 2015-01-04      C
14      0.545215 2015-01-05      C

如果要平整相关矩阵:

corr.stack().reset_index()
  symbol_1 symbol_2         0
0        A        A  1.000000
1        A        B  0.188065
2        A        C -0.745115
3        B        A  0.188065
4        B        B  1.000000
5        B        C -0.688808
6        C        A -0.745115
7        C        B -0.688808
8        C        C  1.000000