Python pandas使用以矢量化方式滚动应用于groupby对象来计算机动车辆测试版

时间:2016-01-15 01:36:17

标签: python pandas vectorization apply beta

我有一个大数据框df,包含4列:

             id           period  ret_1m   mkt_ret_1m
131146       CAN00WG0     199609 -0.1538    0.047104
133530       CAN00WG0     199610 -0.0455   -0.014143
135913       CAN00WG0     199611  0.0000    0.040926
138334       CAN00WG0     199612  0.2952    0.008723
140794       CAN00WG0     199701 -0.0257    0.039916
143274       CAN00WG0     199702 -0.0038   -0.025442
145754       CAN00WG0     199703 -0.2992   -0.049279
148246       CAN00WG0     199704 -0.0919   -0.005948
150774       CAN00WG0     199705  0.0595    0.122322
153318       CAN00WG0     199706 -0.0337    0.045765

             id           period  ret_1m   mkt_ret_1m
160980       CAN00WH0     199709  0.0757    0.079293
163569       CAN00WH0     199710 -0.0741   -0.044000
166159       CAN00WH0     199711  0.1000   -0.014644
168782       CAN00WH0     199712 -0.0909   -0.007072
171399       CAN00WH0     199801 -0.0100    0.001381
174022       CAN00WH0     199802  0.1919    0.081924
176637       CAN00WH0     199803  0.0085    0.050415
179255       CAN00WH0     199804 -0.0168    0.018393
181880       CAN00WH0     199805  0.0427   -0.051279
184516       CAN00WH0     199806 -0.0656   -0.011516

             id           period  ret_1m   mkt_ret_1m
143275       CAN00WO0     199702 -0.1176   -0.025442
145755       CAN00WO0     199703 -0.0074   -0.049279
148247       CAN00WO0     199704 -0.0075   -0.005948
150775       CAN00WO0     199705  0.0451    0.122322

我正在尝试使用一个函数来计算一个称为beta的常见财务指标,该指标包含两个列,ret_1m,每月stock_return和ret_1m_mkt,同一时期的市场1个月回报(period_id)。我想应用一个函数(calc_beta)来计算12个月滚动基础上该函数的12个月结果。

为此,我创建了一个groupby对象:

grp = df.groupby('id')

我想做的是使用类似的东西:

period = 12
for stock, sub_df in grp:
    arg = sub_df[['ret_1m', 'mkt_ret_1m']]
    beta = pd.rolling_apply(arg, period, calc_beta, min_periods = period)

现在,这是第一个问题。根据文档,pd.rolling_apply arg可以是系列或数据框架。但是,似乎我提供的数据帧被转换为一个只能包含一列数据的numpy数组,而不是我试图提供的两个数据。所以我在下面的calc_beta代码不起作用,因为我需要传递股票和市场回报:

def calc_beta(np_array)
    s = np_array[:,0] # stock returns are column zero from numpy array
    m = np_array[:,1] # market returns are column one from numpy array

    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
return beta

所以我的问题如下,我认为以这种方式列出它们是有意义的:

(i)  How can I pass a data frame/multiple series/numpy array with more than one column to calc_beta using rolling_apply?
(ii) How can I return more than one value (e.g. the beta) from the calc_beta function? 
(iii) Having calculated rolling quantities, how can I recombined with the original dataframe df so that I have the rolling quantities corresponding to the correct date in the period column?
(iv) Is there a better (vectorized) way of achieving this?  I have seen some similar questions using e.g. df.apply(pd.rolling_apply,period,??) but I did not understand how these worked.

我认为rolling_apply以前无法处理数据框,但文档表明它现在能够这样做。我的熊猫。版本是0.16.1。

感谢您的帮助!我已经失去了1.5天试图解决这个问题并且完全被难倒了。

最终,我想要的是这样的:

             id           period  ret_1m   mkt_ret_1m  beta  other_quantities
131146       CAN00WG0     199609 -0.1538    0.047104  0.521  xxx
133530       CAN00WG0     199610 -0.0455   -0.014143  0.627  xxxx
135913       CAN00WG0     199611  0.0000    0.040926  0.341  xxx
138334       CAN00WG0     199612  0.2952    0.008723  0.567  xx
140794       CAN00WG0     199701 -0.0257    0.039916  0.4612 xxx
143274       CAN00WG0     199702 -0.0038   -0.025442  0.215  xxx
145754       CAN00WG0     199703 -0.2992   -0.049279  0.4678  xxx
148246       CAN00WG0     199704 -0.0919   -0.005948  -0.4225  xxx
150774       CAN00WG0     199705  0.0595    0.122322  0.780  xxx
153318       CAN00WG0     199706 -0.0337    0.045765  0.623  xxx

             id           period  ret_1m   mkt_ret_1m  beta  other_quantities
160980       CAN00WH0     199709  0.0757    0.079293  -0.913  xx
163569       CAN00WH0     199710 -0.0741   -0.044000  0.894  xxx
166159       CAN00WH0     199711  0.1000   -0.014644  0.563  xxx
168782       CAN00WH0     199712 -0.0909   -0.007072  0.734  xxx
171399       CAN00WH0     199801 -0.0100    0.001381  0.894  xxxx
174022       CAN00WH0     199802  0.1919    0.081924  0.789  xx
176637       CAN00WH0     199803  0.0085    0.050415  0.1563  xxxx
179255       CAN00WH0     199804 -0.0168    0.018393  -0.64  xxxx
181880       CAN00WH0     199805  0.0427   -0.051279  -0.742  xxx
184516       CAN00WH0     199806 -0.0656   -0.011516  0.925  xxx

             id           period  ret_1m   mkt_ret_1m  beta
143275       CAN00WO0     199702 -0.1176   -0.025442  -1.52  xx
145755       CAN00WO0     199703 -0.0074   -0.049279  -0.632  xxx
148247       CAN00WO0     199704 -0.0075   -0.005948  1.521  xx
150775       CAN00WO0     199705  0.0451    0.122322  0.0321  xxx

5 个答案:

答案 0 :(得分:3)

我猜pd.rolling_apply在这种情况下没有帮助,因为在我看来它基本上只需要Series(即使数据帧通过,它也会处理一个列一次)。但是你总是可以自己编写一个带有数据帧的rolling_apply。

import pandas as pd
import numpy as np
from StringIO import StringIO

df = pd.read_csv(StringIO('''              id  period  ret_1m  mkt_ret_1m
131146  CAN00WG0  199609 -0.1538    0.047104
133530  CAN00WG0  199610 -0.0455   -0.014143
135913  CAN00WG0  199611  0.0000    0.040926
138334  CAN00WG0  199612  0.2952    0.008723
140794  CAN00WG0  199701 -0.0257    0.039916
143274  CAN00WG0  199702 -0.0038   -0.025442
145754  CAN00WG0  199703 -0.2992   -0.049279
148246  CAN00WG0  199704 -0.0919   -0.005948
150774  CAN00WG0  199705  0.0595    0.122322
153318  CAN00WG0  199706 -0.0337    0.045765
160980  CAN00WH0  199709  0.0757    0.079293
163569  CAN00WH0  199710 -0.0741   -0.044000
166159  CAN00WH0  199711  0.1000   -0.014644
168782  CAN00WH0  199712 -0.0909   -0.007072
171399  CAN00WH0  199801 -0.0100    0.001381
174022  CAN00WH0  199802  0.1919    0.081924
176637  CAN00WH0  199803  0.0085    0.050415
179255  CAN00WH0  199804 -0.0168    0.018393
181880  CAN00WH0  199805  0.0427   -0.051279
184516  CAN00WH0  199806 -0.0656   -0.011516
143275  CAN00WO0  199702 -0.1176   -0.025442
145755  CAN00WO0  199703 -0.0074   -0.049279
148247  CAN00WO0  199704 -0.0075   -0.005948
150775  CAN00WO0  199705  0.0451    0.122322'''), sep='\s+')



def calc_beta(df):
    np_array = df.values
    s = np_array[:,0] # stock returns are column zero from numpy array
    m = np_array[:,1] # market returns are column one from numpy array

    covariance = np.cov(s,m) # Calculate covariance between stock and market
    beta = covariance[0,1]/covariance[1,1]
    return beta

def rolling_apply(df, period, func, min_periods=None):
    if min_periods is None:
        min_periods = period
    result = pd.Series(np.nan, index=df.index)

    for i in range(1, len(df)+1):
        sub_df = df.iloc[max(i-period, 0):i,:] #I edited here
        if len(sub_df) >= min_periods:
            idx = sub_df.index[-1]
            result[idx] = func(sub_df)
    return result

df['beta'] = np.nan
grp = df.groupby('id')
period = 6 #I'm using 6  to see some not NaN values, since sample data don't have longer than 12 groups
for stock, sub_df in grp:
    beta = rolling_apply(sub_df[['ret_1m','mkt_ret_1m']], period, calc_beta, min_periods = period)  
    beta.name = 'beta'
    df.update(beta)
print df

输出

            id  period  ret_1m  mkt_ret_1m      beta
131146  CAN00WG0  199609 -0.1538    0.047104       NaN
133530  CAN00WG0  199610 -0.0455   -0.014143       NaN
135913  CAN00WG0  199611  0.0000    0.040926       NaN
138334  CAN00WG0  199612  0.2952    0.008723       NaN
140794  CAN00WG0  199701 -0.0257    0.039916       NaN
143274  CAN00WG0  199702 -0.0038   -0.025442 -1.245908
145754  CAN00WG0  199703 -0.2992   -0.049279  2.574464
148246  CAN00WG0  199704 -0.0919   -0.005948  2.657887
150774  CAN00WG0  199705  0.0595    0.122322  1.371090
153318  CAN00WG0  199706 -0.0337    0.045765  1.494095
...          ...     ...     ...         ...       ...
171399  CAN00WH0  199801 -0.0100    0.001381       NaN
174022  CAN00WH0  199802  0.1919    0.081924  1.542782
176637  CAN00WH0  199803  0.0085    0.050415  1.605407
179255  CAN00WH0  199804 -0.0168    0.018393  1.571015
181880  CAN00WH0  199805  0.0427   -0.051279  1.139972
184516  CAN00WH0  199806 -0.0656   -0.011516  1.101890
143275  CAN00WO0  199702 -0.1176   -0.025442       NaN
145755  CAN00WO0  199703 -0.0074   -0.049279       NaN
148247  CAN00WO0  199704 -0.0075   -0.005948       NaN
150775  CAN00WO0  199705  0.0451    0.122322       NaN

答案 1 :(得分:3)

尝试pd.rolling_cov()和pd.rolling.var(),如下所示:

import pandas as pd
import numpy as np
from StringIO import StringIO

    df = pd.read_csv(StringIO('''              id  period  ret_1m  mkt_ret_1m
    131146  CAN00WG0  199609 -0.1538    0.047104
    133530  CAN00WG0  199610 -0.0455   -0.014143
    135913  CAN00WG0  199611  0.0000    0.040926
    138334  CAN00WG0  199612  0.2952    0.008723
    140794  CAN00WG0  199701 -0.0257    0.039916
    143274  CAN00WG0  199702 -0.0038   -0.025442
    145754  CAN00WG0  199703 -0.2992   -0.049279
    148246  CAN00WG0  199704 -0.0919   -0.005948
    150774  CAN00WG0  199705  0.0595    0.122322
    153318  CAN00WG0  199706 -0.0337    0.045765
    160980  CAN00WH0  199709  0.0757    0.079293
    163569  CAN00WH0  199710 -0.0741   -0.044000
    166159  CAN00WH0  199711  0.1000   -0.014644
    168782  CAN00WH0  199712 -0.0909   -0.007072
    171399  CAN00WH0  199801 -0.0100    0.001381
    174022  CAN00WH0  199802  0.1919    0.081924
    176637  CAN00WH0  199803  0.0085    0.050415
    179255  CAN00WH0  199804 -0.0168    0.018393
    181880  CAN00WH0  199805  0.0427   -0.051279
    184516  CAN00WH0  199806 -0.0656   -0.011516
    143275  CAN00WO0  199702 -0.1176   -0.025442
    145755  CAN00WO0  199703 -0.0074   -0.049279
    148247  CAN00WO0  199704 -0.0075   -0.005948
    150775  CAN00WO0  199705  0.0451    0.122322'''), sep='\s+')

    df['beta'] = pd.rolling_cov(df['ret_1m'], df['mkt_ret_1m'], window=6) / pd.rolling_var(df['mkt_ret_1m'], window=6)

print df

输出:

              id  period  ret_1m  mkt_ret_1m      beta
131146  CAN00WG0  199609 -0.1538    0.047104       NaN
133530  CAN00WG0  199610 -0.0455   -0.014143       NaN
135913  CAN00WG0  199611  0.0000    0.040926       NaN
138334  CAN00WG0  199612  0.2952    0.008723       NaN
140794  CAN00WG0  199701 -0.0257    0.039916       NaN
143274  CAN00WG0  199702 -0.0038   -0.025442 -1.245908
145754  CAN00WG0  199703 -0.2992   -0.049279  2.574464
148246  CAN00WG0  199704 -0.0919   -0.005948  2.657887
150774  CAN00WG0  199705  0.0595    0.122322  1.371090
153318  CAN00WG0  199706 -0.0337    0.045765  1.494095
160980  CAN00WH0  199709  0.0757    0.079293  1.616520
163569  CAN00WH0  199710 -0.0741   -0.044000  1.630411
166159  CAN00WH0  199711  0.1000   -0.014644  0.651220
168782  CAN00WH0  199712 -0.0909   -0.007072  0.652148
171399  CAN00WH0  199801 -0.0100    0.001381  0.724120
174022  CAN00WH0  199802  0.1919    0.081924  1.542782
176637  CAN00WH0  199803  0.0085    0.050415  1.605407
179255  CAN00WH0  199804 -0.0168    0.018393  1.571015
181880  CAN00WH0  199805  0.0427   -0.051279  1.139972
184516  CAN00WH0  199806 -0.0656   -0.011516  1.101890
143275  CAN00WO0  199702 -0.1176   -0.025442  1.372437
145755  CAN00WO0  199703 -0.0074   -0.049279  0.031939
148247  CAN00WO0  199704 -0.0075   -0.005948 -0.535855
150775  CAN00WO0  199705  0.0451    0.122322  0.341747

答案 2 :(得分:1)

def rolling_apply(df, period, func, min_periods=None):
    if min_periods is None:
        min_periods = period
    result = pd.Series(np.nan, index=df.index)

    for i in range(1, len(df)):
        sub_df = df.iloc[max(i-period, 0):i,:] #get a subsample to run
        if len(sub_df) >= min_periods:
            idx = sub_df.index[-1]+1 # mind the forward looking bias,your return in time t should not be inclued in the beta calculating in time t
            result[idx] = func(sub_df)
    return result

我修复了Happy001's code的前瞻性偏见。这是一个财务问题,所以应该谨慎。

我发现vlmercado的答案很错误。如果仅使用pd.rolling_cov和pd.rolling_var,那么您在财务上会犯错误。首先,很明显第二个股票CAN00WH0没有任何NaN beta,因为它使用了CAN00WG0的返回值,这是完全错误的。其次,考虑这样一种情况:股票停产十年,您也可以将该样本纳入beta计算。

我发现pandas.rolling也适用于Timestamp,但是groupby似乎不行。因此,我更改了Happy001's code的代码。这不是最快的方法,但至少比原始代码快20倍。

crsp_daily['date']=pd.to_datetime(crsp_daily['date'])
crsp_daily=crsp_daily.set_index('date') # rolling needs a time serie index
crsp_daily.index=pd.DatetimeIndex(crsp_daily.index)
calc=crsp_daily[['permno','ret','mkt_ret']]
grp = calc.groupby('permno') #rolling beta for each stock
beta=pd.DataFrame()
for stock, sub_df in grp:
        sub2_df=sub_df[['ret','mkt_ret']].sort_index() 
        beta_m = sub2_df.rolling('1825d',min_periods=150).cov() # 5yr rolling beta , note that d for day, and you cannot use w/m/y, s/d are availiable.
        beta_m['beta']=beta_m['ret']/beta_m['mkt_ret']
        beta_m=beta_m.xs('mkt_ret',level=1,axis=0)
        beta=beta.append(pd.merge(sub_df,pd.DataFrame(beta_m['beta'])))
beta=beta.reset_index()
beta=beta[['date','permno','beta']]

答案 3 :(得分:1)

顺便说一句,由于没有人问python中的多变量滚动回归,我也找到了解决这个问题的方法。关键是先把它们堆成一列,然后在函数中重塑dataframe。

这是代码

import pandas as pd
import numpy as np
import timeit
from numba import jit 

@jit(nopython=True, cache=True,fastmath=True) # numba only support numpy but pandas, so pd.DataFrame is forbiddened.
def coefcalc(df,coefpos,varnum): 
# coefpos: which coef you need, for example I want alpha, so I would set the coefpos as 5 to obtain the alpha since the constant is in the last column in X. varnum: how many variables you put in df except "stkcd and date", in this sample, it's 7 (return,five fama factors, and a constant)

    if np.mod(df.shape[0],varnum)==0:
        df=df.reshape(df.shape[0]//varnum,varnum) # reshape the one column to n column for reg.
        Y=np.ascontiguousarray(df[:,0]) # rebuild a contigous numpy array for faster reg
        X=np.ascontiguousarray(df[:,1:])
        try:
            b=(np.linalg.inv(X.T@X))@X.T@Y
            result=b[coefpos]
        except:
            result=np.nan
        return result
    else:
        return np.nan
    
calc2=pd.read_csv(r'sample.csv')

# A way for rolling beta/alpha
calc2=calc2.set_index(['date','F_INFO_WINDCODE'])
calc2=calc2.dropna() # regression will drop Nan automatically
calc2=calc2.stack().reset_index().set_index('date') # put n columns into one columns, and let datetime64 variable (date) to be the index.
localtime = time.asctime( time.localtime(time.time()) )
print(localtime)
order_of_variable=5  # expect for y (return), start from zero.
total_number_variable=7 # stkcd/date/return/fama5/constant
required_sample=30*total_number_variable # Monthly data

# the parallel kwarg may require loops in def so I turn it off.
alphaest=calc2.groupby('F_INFO_WINDCODE').rolling('1095d',min_periods=required_sample)[0].apply(lambda x:coefcalc(x,5,7),engine='numba', raw=True,engine_kwargs={'nopython': True, 'nogil': True, 'parallel': False})

# as the pandas's document numba engine is faster than cpython when obs is over 1 million.
# you can check in https://pandas.pydata.org/pandas-docs/stable/user_guide/window.html , the numba engine part.

localtime = time.asctime( time.localtime(time.time()) )
print(localtime)

sample format

答案 4 :(得分:1)

好消息!在pandas 1.3.0中,rolling.apply增加了一个新的方法“Table”,一切都解决了!

这是一个示例代码。

def coefcalc(df):
Y=np.ascontiguousarray(df[:,0]) # rebuild a contigous numpy array for faster reg
X=np.ascontiguousarray(df[:,1:])
try:
    b=(np.linalg.inv(X.T@X))@X.T@Y
    return np.nan,b[0],b[1],b[2],b[3],b[4]
except:
    return np.nan,np.nan,np.nan,np.nan,np.nan,np.nan

fama5=pd.read_csv('F-F_Research_Data_5_Factors_2x3.csv')
fama5=fama5.iloc[0:695,:].rename(columns={'Mkt-RF':'Mkt_Rf'})
fama5['date']=pd.to_datetime(fama5.date,format='%Y%m',errors='ignore')
for var in ['Mkt_Rf','SMB','HML','RMW','CMA','RF']:
    fama5[var]=pd.to_numeric(fama5[var])/100
    fama5[var]=fama5[var].astype('float64')

fama5=fama5.set_index('date')
fama5=fama5.drop('RF',axis=1)

beta=fama5.rolling('100d', method="table", min_periods=0).apply(coefcalc, raw=True, engine="numba")

您可以从 Ken French 的主页下载 F-F_Research_Data_5_Factors_2x3.csv https://mba.tuck.dartmouth.edu/pages/faculty/ken.french/ftp/F-F_Research_Data_5_Factors_2x3_CSV.zip