python groupwise winsorization和线性回归

时间：2016-05-16 16:10:24

标签： python pandas linear-regression statsmodels

我是蟒蛇世界的新手。我必须处理金融数据集。假设我有一个如下所示的数据框：

TradingDate StockCode       Size     ILLIQ
0    20050131    000001  13.980320   77.7522
1    20050131    000002  14.071253   19.1471
2    20050131    000004  10.805564  696.2428
3    20050131    000005  11.910485  621.3723
4    20050131    000006  11.631550  339.0952
*** ***

我想要做的是进行分组OLS回归，其中分组变量是TradingDate，因变量是＆＃39;大小＆＃39;自变量是＆＃39; ILLIQ＆＃39;。我想将回归的剩余项目追加到原始的Dataframe，比如一个名为＆＃39; Residual＆＃39;的新列。我该怎么办呢？

以下代码似乎无效？

def regress(data,yvar,xvars):
    Y = data[yvar]
    X = data[xvars]
    X['intercept']=1.
    result = sm.OLS(Y,X).fit()
    return result.resid()

by_Date = df.groupby('TradingDate')
by_Date.apply(regress,'ILLIQ',['Size'])

2 个答案:

答案 0 :(得分：0)

您只需使用.resid来访问残差 - .resid只是一个属性，而不是方法（see docs）。简化说明：

import statsmodels.formula.api as sm
df = df.set_index('TradingDate', inplace=True)
df['residuals'] = df.groupby(level=0).apply(lambda x: pd.DataFrame(sm.ols(formula="Size ~ ILLIQ", data=x).fit().resid)).values

             StockCode       Size     ILLIQ  residuals
TradingDate                                           
20050131             1  13.980320   77.7522   0.299278
20050131             2  14.071253   19.1471   0.132318
20050131             4  10.805564  696.2428  -0.153800
20050131             5  11.910485  621.3723   0.621652
20050131             6  11.631550  339.0952  -0.899448

答案 1 :(得分：0)

设置

from StringIO import StringIO
import pandas as pd

text = """TradingDate StockCode       Size     ILLIQ
0    20050131    000001  13.980320   77.7522
1    20050131    000002  14.071253   19.1471
2    20050131    000004  10.805564  696.2428
3    20050131    000005  11.910485  621.3723
4    20050131    000006  11.631550  339.0952"""

df = pd.read_csv(StringIO(text), delim_whitespace=1,
                 converters=dict(TradingDate=pd.to_datetime))

解决方案

def regress(data,yvar,xvars):
    # I changed this a bit to ensure proper dimensional alignment
    Y = data[[yvar]].copy()
    X = data[xvars].copy()
    X['intercept'] = 1
    result = sm.OLS(Y,X).fit()
    # resid is an attribute not a method
    return result.resid

def append_resids(df, yvar, xvars):
    """New helper to return DataFrame object within groupby apply
    df = df.copy()
    df['residuals'] = regress(df, yvar, xvars)
    return df

df.groupby('TradingDate').apply(lambda x: append_resids(x, 'ILLIQ', ['Size']))