groupby上的复杂/多列操作

时间:2017-01-19 14:43:34

标签: python pandas

我想进行多列操作(即下面的correlate)以及在先前的计算中使用结果的操作(即下面的diff计算),而不使用for循环并使用本地熊猫的功能如groupbyagg。这可能吗?

import pandas as pd
import datetime
import numpy as np

np.random.seed(0)
df = pd.DataFrame({'date': [datetime.datetime(2010,1,1)+datetime.timedelta(days=i*15) 
                            for i in range(0,100)],
                   'invested': np.random.random(100)*1e6,
                   'return': np.random.random(100),
                   'side': np.random.choice([-1, 1], 100)})

df['year'] = df['date'].apply(lambda x: x.year)

# want to get rid of the for loop below
ret_year = []
for year in list(list(df['year'].unique())):
    df_this_year = df[df['year'] == year]
    min_short = df_this_year[df_this_year['side'] == -1]['return'].max()
    min_long = df_this_year[df_this_year['side'] == -1]['return'].min()
    min_diff = min_long - min_short
    avg_inv = df_this_year['invested'].mean()
    corr = np.correlate(df_this_year['invested'], df_this_year['return'])[0]
    ret_year.append({'year': year, 'min_short': min_short, 'min_long': min_long,
                     'min_diff': min_diff, 'avg_inv': avg_inv, 'corr': corr})

print(pd.DataFrame(ret_year))

结果:

         avg_inv          corr  min_diff  min_long  min_short  year
0  590766.254452  8.821215e+06 -0.664752  0.297437   0.962189  2010
1  490224.532564  6.122306e+06 -0.900289  0.019193   0.919483  2011
2  438330.806563  4.768964e+06 -0.929680  0.069167   0.998847  2012
3  373038.880789  4.677380e+06 -0.779678  0.164694   0.944372  2013
4  416817.752705  5.014249e+04  0.000000  0.434417   0.434417  2014

以下是一些类似的问题,但不完全相同:

1 个答案:

答案 0 :(得分:2)

不要迭代for循环,而是利用pandas groupby + apply。通过将日期列放入索引并按年pd.TimeGrouper('A') - ' A'是年度的熊猫日期偏移别名。

def calculate(x):
    min_short = x.loc[x['side'] == -1, 'return'].max()
    min_long = x.loc[x['side'] == -1, 'return'].min()
    min_diff = min_long - min_short
    avg_inv = x['invested'].mean()
    corr = np.correlate(x['invested'], x['return'])[0]
    return pd.Series([avg_inv, corr, min_diff, min_long, min_short], 
                     index=['avg_inv','corr','min_diff','min_long','min_short'])

df.groupby(pd.TimeGrouper('A')).apply(calculate).to_period('A')


            avg_inv          corr  min_diff  min_long  min_short
date                                                            
2010  590766.254452  8.821215e+06 -0.664752  0.297437   0.962189
2011  490224.532564  6.122306e+06 -0.900289  0.019193   0.919483
2012  438330.806563  4.768964e+06 -0.929680  0.069167   0.998847
2013  373038.880789  4.677380e+06 -0.779678  0.164694   0.944372
2014  416817.752705  5.014249e+04  0.000000  0.434417   0.434417