前N个最相关列的平均值的DataFrame

时间:2016-06-07 23:32:55

标签: python pandas

我有一个数据框df2,其中每列代表一个返回的时间序列。我想创建一个新的数据框df1,其中的列对应df2中的每一列,其中df1中的列定义为前5个最相关列中的列的平均值import pandas as pd import numpy as np from string import ascii_letters np.random.seed([3,1415]) df1 = pd.DataFrame(np.random.randn(100, 10).round(2), columns=list(ascii_letters[26:36])) print df1.head() A B C D E F G H I J 0 -2.13 -1.27 -1.97 -2.26 -0.35 -0.03 0.32 0.35 0.72 0.77 1 -0.61 0.35 -0.35 -0.42 -0.91 -0.14 0.75 -1.50 0.61 0.40 2 -0.96 1.49 -0.35 -1.47 1.06 1.06 0.59 0.30 -0.77 0.83 3 1.49 0.26 -0.90 0.38 -0.52 0.05 0.95 -1.03 0.95 0.73 4 1.24 0.16 -1.34 0.16 1.26 0.78 1.34 -1.64 -0.20 0.13

head

我希望结果数据框的 A B C D E F G H I J 0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64 1 -0.49 -0.11 -0.45 -0.03 -0.04 0.10 -0.26 0.11 -0.06 -0.10 2 0.03 0.13 0.54 0.33 -0.13 0.27 0.22 0.32 0.41 0.27 3 -0.22 0.13 0.19 0.58 0.63 0.24 0.34 0.51 0.32 0.22 4 -0.04 0.31 0.23 0.52 0.43 0.24 0.07 0.31 0.73 0.43 四舍五入到两个位置:

import pandas as pd
import numpy as np
from string import ascii_letters

np.random.seed([3,1415])
df1 = pd.DataFrame(np.random.randn(100, 10).round(2),
                   columns=list(ascii_letters[26:36]))

2 个答案:

答案 0 :(得分:3)

对于相关矩阵中的每一列,取六个最大值并忽略第一个(即100%与其自身相关)。使用字典理解为每列执行此操作。

使用另一个词典理解来在df1中找到这些列并取其平均值。从结果中创建一个数据框,并通过附加[df1.columns]对列重新排序以匹配df1。

corr = df1.corr()
most_correlated_cols = {col: corr[col].nlargest(6)[1:].index
                        for col in corr}

df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1) 
                    for col in df1})[df1.columns]

>>> df2.head()
       A      B      C      D      E      F      G      H      I      J
0 -0.782 -0.698 -0.526 -0.452 -0.994 -0.102 -0.472 -0.856 -0.310 -0.638
1 -0.486 -0.106 -0.454 -0.032 -0.042  0.100 -0.258  0.108 -0.064 -0.102
2  0.026  0.132  0.544  0.330 -0.130  0.272  0.224  0.320  0.414  0.274
3 -0.224  0.128  0.186  0.582  0.626  0.242  0.344  0.506  0.318  0.224
4 -0.044  0.310  0.230  0.518  0.428  0.238  0.068  0.306  0.734  0.432

%%timeit
corr = df1.corr()
most_correlated_cols = {
   col: corr[col].nlargest(6)[1:].index
   for col in corr}
df2 = pd.DataFrame({col: df1.loc[:, most_correlated_cols[col]].mean(axis=1) 
                    for col in df1})[df1.columns]
100 loops, best of 3: 10 ms per loop

%%timeit
corr = df1.corr()
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df1))
100 loops, best of 3: 16 ms per loop

答案 1 :(得分:0)

设置

corr = df.corr()

# I don't want a securities correlation with itself to be included.
# Because `corr` is symmetrical, I can assume that a series' name will be in its index.
def remove_self(x):
    return x.loc[x.index != x.name]

# This builds utilizes `remove_self` then sorts by correlation
# and returns the index.
def argsort(x):
    return pd.Series(remove_self(x).sort_values(ascending=False).index)

# This reaches into `df` and gets all columns identified in x
# then takes the mean.
def avg_of(x, df):
    return df.loc[:, x].mean(axis=1)

# Putting it all together.
df2 = corr.apply(argsort).head(5).apply(lambda x: avg_of(x, df))

print df2.round(2).head()

      A     B     C     D     E     F     G     H     I     J
0 -0.78 -0.70 -0.53 -0.45 -0.99 -0.10 -0.47 -0.86 -0.31 -0.64
1 -0.49 -0.11 -0.45 -0.03 -0.04  0.10 -0.26  0.11 -0.06 -0.10
2  0.03  0.13  0.54  0.33 -0.13  0.27  0.22  0.32  0.41  0.27
3 -0.22  0.13  0.19  0.58  0.63  0.24  0.34  0.51  0.32  0.22
4 -0.04  0.31  0.23  0.52  0.43  0.24  0.07  0.31  0.73  0.43

解决方案

{{1}}