数据帧与旋转值的相关性(循环?)

时间:2018-03-31 13:34:08

标签: loops dataframe correlation

我有一个以下格式的Dataframe,我正在尝试创建df [&#39; New&#39;],它是一个旋转值,如下所示,我将用它来计算Alpha和New之间的相关性< / p>

Date       Alpha Bravo Charlie   New                         Correlation
2018-01-03    1     3      2       3 (from bravo column)          NaN
2018-01-04    2     6      4       6 (from bravo column)          NaN
2018-01-05    3     9      6       9 (from bravo column)          NaN
2018-01-06    4    12      8      12 (from bravo column)          NaN
2018-01-07    5    15     10      10 (from Charlie column)         X

下一个日期:

Date       Alpha Bravo Charlie   New                         Correlation
2018-01-03    1     3      2       3 (from bravo column)          NaN
2018-01-04    2     6      4       6 (from bravo column)          NaN
2018-01-05    3     9      6       9 (from bravo column)          NaN
2018-01-06    4    12      8      12 (from bravo column)          NaN
2018-01-07    5    15     10      15 (from bravo column)           X  
2018-01-08    6    18     12      12 (from Charlie column)         Y

df['Correlation'] = df['Alpha'].rolling(window=5).corr(other=df['New'])

建议如何使用旋转值创建此新列? (这样我之前的相关性将保持不变为X.我的最终目标是获取Correlation列,而New column仅用于计算相关性)

换句话说,每次计算相关列时,它都会使用最新的值作为查理,但之前的所有值都是布拉沃。

另一种解释方法是将始终使用Charlie列的最后日期和过去4天的bravo来计算与Alpha的相关性,如下所示:

enter image description here

1 个答案:

答案 0 :(得分:1)

我认为您需要首先添加NaN s然后this solution添加strides,然后获取相关矩阵:

def rolling_window(a, window):
    shape = a.shape[:-1] + (a.shape[-1] - window + 1, window)
    strides = a.strides + (a.strides[-1],)
    return np.lib.stride_tricks.as_strided(a, shape=shape, strides=strides)

N = 5
a = np.concatenate([[np.nan] * (N-1), df['Bravo'].values])
b = np.concatenate([[np.nan] * (N-1), df['Alpha'].values])
a1 = rolling_window(a, N)
a2 = rolling_window(b, N)

删除a1的最后一列,并添加Charlie列的值:

c = np.c_[a1[:, :-1], df['Charlie'].values[:, None]] 
print (c)
[[nan nan nan nan  2.]
 [nan nan nan  3.  4.]
 [nan nan  3.  6.  6.]
 [nan  3.  6.  9.  8.]
 [ 3.  6.  9. 12. 10.]
 [ 6.  9. 12. 15. 12.]
 [ 9. 12. 15. 18. 15.]]

创建数据框并按NaN删除第一行iloc

a = pd.DataFrame(a2, index=df.index).iloc[N-1:]
b = pd.DataFrame(c, index=df.index).iloc[N-1:]
df['Correlation1'] = a.corrwith(b, axis=1)
#for improve performance
#https://stackoverflow.com/a/41703623/2901002
df['Correlation2'] = corr2_coeff_rowwise(a2, c)

print (df)
        Date  Alpha  Bravo  Charlie  Correlation1  Correlation2
0 2018-01-03      1      3        2           NaN           NaN
1 2018-01-04      2      6        4           NaN           NaN
2 2018-01-05      3      9        6           NaN           NaN
3 2018-01-06      4     12        8           NaN           NaN
4 2018-01-07      5     15       10      0.894427      0.894427
5 2018-01-08      6     18       12      0.832050      0.832050
6 2018-01-09      7     21       15      0.832050      0.832050