我是蟒蛇世界的新手。我必须处理金融数据集。假设我有一个如下所示的数据框:
TradingDate StockCode Size ILLIQ
0 20050131 000001 13.980320 77.7522
1 20050131 000002 14.071253 19.1471
2 20050131 000004 10.805564 696.2428
3 20050131 000005 11.910485 621.3723
4 20050131 000006 11.631550 339.0952
*** ***
我想要做的是进行分组OLS回归,其中分组变量是TradingDate,因变量是'大小'自变量是' ILLIQ'。我想将回归的剩余项目追加到原始的Dataframe,比如一个名为' Residual'的新列。我该怎么办呢?
以下代码似乎无效?
def regress(data,yvar,xvars):
Y = data[yvar]
X = data[xvars]
X['intercept']=1.
result = sm.OLS(Y,X).fit()
return result.resid()
by_Date = df.groupby('TradingDate')
by_Date.apply(regress,'ILLIQ',['Size'])
答案 0 :(得分:0)
您只需使用.resid
来访问残差 - .resid
只是一个属性,而不是方法(see docs)。简化说明:
import statsmodels.formula.api as sm
df = df.set_index('TradingDate', inplace=True)
df['residuals'] = df.groupby(level=0).apply(lambda x: pd.DataFrame(sm.ols(formula="Size ~ ILLIQ", data=x).fit().resid)).values
StockCode Size ILLIQ residuals
TradingDate
20050131 1 13.980320 77.7522 0.299278
20050131 2 14.071253 19.1471 0.132318
20050131 4 10.805564 696.2428 -0.153800
20050131 5 11.910485 621.3723 0.621652
20050131 6 11.631550 339.0952 -0.899448
答案 1 :(得分:0)
from StringIO import StringIO
import pandas as pd
text = """TradingDate StockCode Size ILLIQ
0 20050131 000001 13.980320 77.7522
1 20050131 000002 14.071253 19.1471
2 20050131 000004 10.805564 696.2428
3 20050131 000005 11.910485 621.3723
4 20050131 000006 11.631550 339.0952"""
df = pd.read_csv(StringIO(text), delim_whitespace=1,
converters=dict(TradingDate=pd.to_datetime))
def regress(data,yvar,xvars):
# I changed this a bit to ensure proper dimensional alignment
Y = data[[yvar]].copy()
X = data[xvars].copy()
X['intercept'] = 1
result = sm.OLS(Y,X).fit()
# resid is an attribute not a method
return result.resid
def append_resids(df, yvar, xvars):
"""New helper to return DataFrame object within groupby apply
df = df.copy()
df['residuals'] = regress(df, yvar, xvars)
return df
df.groupby('TradingDate').apply(lambda x: append_resids(x, 'ILLIQ', ['Size']))