使用多个数据帧计算列表中平均值的最快方法

时间:2018-05-07 07:39:31

标签: python pandas refactoring

如果我有一个数据框列表,我想计算列表的平均数据帧和平均值。

通过以下方式生成样本数据:

d1 = np.random.normal(0, 0.5, 20).reshape(-1, 4)
d2 = np.random.normal(0, 0.5, 20).reshape(-1, 4)
d3 = np.random.normal(0, 0.5, 20).reshape(-1, 4)
df1 = pd.DataFrame(d1)
df2 = pd.DataFrame(d2)
df3 = pd.DataFrame(d3)

df_list = [df1, df2, df3]

我会做什么:

from functools import reduce
average_df = reduce(lambda x, y: x + y, df_list)/len(df_list)
average_df
          0         1         2         3
0 -0.034682 -0.022264 -0.138824 -0.146104
1  0.419488  0.383894 -0.152312 -0.306009
2  0.155335  0.317097 -0.225921 -0.178944
3 -0.383138 -0.120236  0.069074  0.050598
4  0.050671  0.368507  0.010924  0.394945

average_value = average_df.mean().mean()
average_value 
0.02560486119

问题: 我认为这不是最好的方法。有没有更快的方式/内置功能呢?

2 个答案:

答案 0 :(得分:1)

如果您不介意丢弃标记的轴,我只会转换为np.array

In [16]: df_list = [df1, df2, df3]

In [17]: df_list
Out[17]:
[          0         1         2         3
 0  0.132306 -0.364364  0.596958  0.406588
 1  0.831853  0.049103 -0.606819 -0.858509
 2  0.251377  0.656292 -0.402637 -0.079849
 3 -0.803913  0.047060  0.684442 -0.593213
 4 -0.376936  0.213803 -0.684231  0.042000,
           0         1         2         3
 0 -0.178879 -0.016869 -0.232023 -0.166521
 1 -0.588778  0.013769  0.540631  0.381502
 2 -0.995349  0.155972  0.023558 -0.307145
 3  0.462249 -0.742847  0.235321  0.395132
 4 -0.053568 -0.329233  0.132231  0.917006,
           0         1         2         3
 0  0.352663 -0.832304  0.072619 -0.393198
 1  1.038936  0.923296  0.657013 -0.034282
 2  0.090368  0.433762 -0.305223 -0.378425
 3  0.046863  0.248066 -0.418274 -0.522701
 4  0.222447 -0.322698 -0.262695 -0.718779]

In [18]: arr = np.array([df.values for df in df_list])

In [19]: arr
Out[19]:
array([[[ 0.1323056 , -0.36436411,  0.59695824,  0.4065878 ],
        [ 0.83185277,  0.04910304, -0.60681886, -0.85850892],
        [ 0.25137706,  0.6562918 , -0.4026369 , -0.07984943],
        [-0.80391254,  0.04706034,  0.68444161, -0.59321321],
        [-0.37693554,  0.21380315, -0.68423123,  0.04199972]],

       [[-0.17887865, -0.01686896, -0.23202261, -0.16652074],
        [-0.58877762,  0.01376924,  0.54063094,  0.38150206],
        [-0.99534857,  0.15597235,  0.02355771, -0.30714476],
        [ 0.46224899, -0.74284654,  0.23532056,  0.39513248],
        [-0.05356796, -0.3292326 ,  0.13223064,  0.91700633]],

       [[ 0.35266324, -0.83230408,  0.07261917, -0.39319835],
        [ 1.03893574,  0.92329583,  0.65701318, -0.03428247],
        [ 0.0903683 ,  0.43376195, -0.30522277, -0.37842503],
        [ 0.04686314,  0.24806568, -0.41827387, -0.52270129],
        [ 0.22244721, -0.32269779, -0.2626949 , -0.71877921]]])

然后你只想:

In [20]: arr.mean(axis=0)
Out[20]:
array([[ 0.10203006, -0.40451238,  0.1458516 , -0.05104376],
       [ 0.42733697,  0.3287227 ,  0.19694175, -0.17042977],
       [-0.21786774,  0.41534203, -0.22810065, -0.25513974],
       [-0.0982668 , -0.14924017,  0.16716277, -0.24026067],
       [-0.0693521 , -0.14604242, -0.27156516,  0.08007561]])

In [21]: arr.mean()
Out[21]: -0.021917893552528236

答案 1 :(得分:0)

sum

sum(df_list) / len(df_list)

          0         1         2         3
0  0.102030 -0.404512  0.145851 -0.051044
1  0.427337  0.328723  0.196942 -0.170430
2 -0.217868  0.415342 -0.228101 -0.255140
3 -0.098267 -0.149240  0.167163 -0.240261
4 -0.069352 -0.146043 -0.271565  0.080076

pd.concat

pd.concat(df_list).mean(level=0)

          0         1         2         3
0  0.102030 -0.404512  0.145851 -0.051044
1  0.427337  0.328723  0.196942 -0.170430
2 -0.217868  0.415342 -0.228101 -0.255140
3 -0.098267 -0.149240  0.167163 -0.240261
4 -0.069352 -0.146043 -0.271565  0.080076