从Pandas到Statsmodels的OLS中不推荐使用的滚动窗口选项

时间:2016-05-19 08:22:54

标签: python pandas deprecated statsmodels

正如标题所示,Pandas中ols命令中的滚动功能选项在statsmodels中迁移到哪里?我似乎无法找到它。 熊猫告诉我厄运正在进行中:

netstat

事实上,如果你做了类似的事情:

FutureWarning: The pandas.stats.ols module is deprecated and will be removed in a future version. We refer to external packages like statsmodels, see some examples here: http://statsmodels.sourceforge.net/stable/regression.html
  model = pd.ols(y=series_1, x=mmmm, window=50)

你得到的结果(窗口不会影响代码的运行)但你只得到整个时期的回归运行参数,而不是它应该应该工作的每个滚动周期的一系列参数

4 个答案:

答案 0 :(得分:9)

我创建了一个ols模块,旨在模仿大熊猫'已弃用MovingOLS;它是here

它有三个核心类:

  • OLS:静态(单窗口)普通最小二乘回归。输出是NumPy数组
  • RollingOLS:滚动(多窗口)普通最小二乘回归。输出是更高维度的NumPy数组。
  • PandasRollingOLS:将RollingOLS的结果包含在pandas系列& DataFrames。旨在模仿已弃用的pandas模块的外观。

请注意,该模块是package(我目前正在上传到PyPi的过程中)的一部分,它需要一个包间导入。

上面的前两个类完全在NumPy中实现,主要使用矩阵代数。 RollingOLS也广泛利用广播。属性很大程度上模仿了statsmodels' OLS RegressionResultsWrapper

一个例子:

import urllib.parse
import pandas as pd
from pyfinance.ols import PandasRollingOLS

# You can also do this with pandas-datareader; here's the hard way
url = "https://fred.stlouisfed.org/graph/fredgraph.csv"

syms = {
    "TWEXBMTH" : "usd", 
    "T10Y2YM" : "term_spread", 
    "GOLDAMGBD228NLBM" : "gold",
}

params = {
    "fq": "Monthly,Monthly,Monthly",
    "id": ",".join(syms.keys()),
    "cosd": "2000-01-01",
    "coed": "2019-02-01",
}

data = pd.read_csv(
    url + "?" + urllib.parse.urlencode(params, safe=","),
    na_values={"."},
    parse_dates=["DATE"],
    index_col=0
).pct_change().dropna().rename(columns=syms)
print(data.head())
#                  usd  term_spread      gold
# DATE                                       
# 2000-02-01  0.012580    -1.409091  0.057152
# 2000-03-01 -0.000113     2.000000 -0.047034
# 2000-04-01  0.005634     0.518519 -0.023520
# 2000-05-01  0.022017    -0.097561 -0.016675
# 2000-06-01 -0.010116     0.027027  0.036599

y = data.usd
x = data.drop('usd', axis=1)

window = 12  # months
model = PandasRollingOLS(y=y, x=x, window=window)

print(model.beta.head())  # Coefficients excluding the intercept
#             term_spread      gold
# DATE                             
# 2001-01-01     0.000033 -0.054261
# 2001-02-01     0.000277 -0.188556
# 2001-03-01     0.002432 -0.294865
# 2001-04-01     0.002796 -0.334880
# 2001-05-01     0.002448 -0.241902

print(model.fstat.head())
# DATE
# 2001-01-01    0.136991
# 2001-02-01    1.233794
# 2001-03-01    3.053000
# 2001-04-01    3.997486
# 2001-05-01    3.855118
# Name: fstat, dtype: float64

print(model.rsq.head())  # R-squared
# DATE
# 2001-01-01    0.029543
# 2001-02-01    0.215179
# 2001-03-01    0.404210
# 2001-04-01    0.470432
# 2001-05-01    0.461408
# Name: rsq, dtype: float64

答案 1 :(得分:6)

使用sklearn滚动测试版

^           : start of string
  [^{\r\n]+ : 1 or more character that is not left curly brace or line break
  \{        : left curly brace, must be escape as it is a special character
  \R        : any kind of line break
  \}        : right curly brace, must be escape as it is a special character

答案 2 :(得分:0)

为完整性添加更快速numpy - 仅限计算仅限于回归系数和最终估算的解决方案

Numpy滚动回归函数

import numpy as np

def rolling_regression(y, x, window=60):
    """ 
    y and x must be pandas.Series
    """
# === Clean-up ============================================================
    x = x.dropna()
    y = y.dropna()
# === Trim acc to shortest ================================================
    if x.index.size > y.index.size:
        x = x[y.index]
    else:
        y = y[x.index]
# === Verify enough space =================================================
    if x.index.size < window:
        return None
    else:
    # === Add a constant if needed ========================================
        X = x.to_frame()
        X['c'] = 1
    # === Loop... this can be improved ====================================
        estimate_data = []
        for i in range(window, x.index.size+1):
            X_slice = X.values[i-window:i,:] # always index in np as opposed to pandas, much faster
            y_slice = y.values[i-window:i]
            coeff = np.dot(np.dot(np.linalg.inv(np.dot(X_slice.T, X_slice)), X_slice.T), y_slice)
            estimate_data.append(coeff[0] * x.values[window-1] + coeff[1])
    # === Assemble ========================================================
        estimate = pandas.Series(data=estimate_data, index=x.index[window-1:]) 
        return estimate             

备注

在某些特定用例中,只需要对回归进行最终估算,x.rolling(window=60).apply(my_ols)似乎有点慢

提醒一下,回归的系数可以计算为矩阵乘积,您可以在wikipedia's least squares page上阅读。这种方法通过numpy的矩阵乘法可以比使用statsmodels中的ols有所加快。此产品以coeff = ...

开头的行表示

答案 3 :(得分:0)

对于一列中的滚动趋势,只需使用:

import numpy as np
def calc_trend(window:int = 30):
    df['trend'] = df.rolling(window = window)['column_name'].apply(lambda x: np.polyfit(np.array(range(0,window)), x, 1)[0], raw=True)

但是,在我的情况下,我浪费时间来查找有关日期的趋势,而日期在另一列中。我必须手动创建功能,但这很容易。首先,将TimeDate转换为表示从t_0开始的天数的int64:

xdays = (df['Date'].values.astype('int64') - df['Date'][0].value) / (1e9*86400)

然后:

def calc_trend(window:int=30):
    for t in range(len(df)):
        if t < window//2:
            continue
        i0 = t  - window//2 # Start window
        i1 = i0 + window    # End window
        xvec = xdays[i0:i1]
        yvec = df['column_name'][i0:i1].values
        df.loc[t,('trend')] = np.polyfit(xvec, yvec, 1)[0]