Pandas滚动OLS Bug版本0.12.0

时间:2014-02-13 15:55:36

标签: python pandas typeerror linear-regression

我有以下用于执行滚动OLS计算的示例数据(这里我是从调试器中执行的):

(Pdb) rhs
['Yield']

(Pdb) lhs
'Returns'

(Pdb) min_periods
12

(Pdb) window
60

(Pdb) intercept
True

(Pdb) print df[rhs].to_string()
                 Yield
EndOfMonthDate        
2001-08-31      0.0561
2001-09-28      0.0360
2001-10-31      0.0500
2001-11-30      0.0500
2001-12-31      0.0500
2002-01-31      0.0191
2002-02-28      0.0563
2002-03-29      0.0557
2002-04-30      0.0600
2002-05-31      0.0569
2002-06-28      0.0571
2002-07-31      0.0450
2002-08-30      0.0416
2002-09-30      0.0360
2002-10-31      0.0395
2002-11-29      0.0422
2010-05-31      0.0323
2010-06-30      0.0311
2010-07-30      0.0300
2010-07-30      0.0300
2010-08-31      0.0251
2010-08-31      0.0251
2010-09-30      0.0250
2010-10-29      0.0271
2010-11-30      0.0287
2010-12-31      0.0347
2010-12-31      0.0347
2012-01-31      0.0201
2012-02-29      0.0197
2012-03-30      0.0220
2012-04-30      0.0199
2012-07-31      0.0141

(Pdb) print df[lhs].to_string()
2001-08-31        -0.005519
2001-09-28        -0.350356
2001-10-31        10.003698
2001-11-30         3.230476
2001-12-31        -3.776050
2002-01-31         9.153807
2002-02-28        -4.175085
2002-03-29        46.890701
2002-04-30       -15.747041
2002-05-31         2.797472
2002-06-28        -1.000851
2002-07-31       -13.398200
2002-08-30        -1.707745
2002-09-30         2.054250
2002-10-31         0.000620
2002-11-29        -9.790426
2010-05-31         0.000012
2010-06-30         0.000012
2010-07-30        -1.745182
2010-07-30        -0.000006
2010-08-31       -20.779633
2010-08-31         0.000000
2010-09-30        -0.000006
2010-10-29        -0.000012
2010-11-30        -0.000006
2010-12-31        30.165554
2010-12-31        -2.549851
2012-01-31        -6.892008
2012-02-29        -1.638216
2012-03-30         4.295588
2012-04-30        -7.094216
2012-07-31        -0.041252

当我尝试滚动OLS时:

(Pdb) pandas.ols(y=df[lhs], x=df[rhs], window=window, min_periods=min_periods, intercept=intercept)
*** TypeError: unsupported operand type(s) for +: 'slice' and 'int'

但如果只是为整个数据范围尝试常规OLS,那么看起来很好:

(Pdb) pandas.ols(y=df[lhs], x=df[rhs], intercept=intercept)

-------------------------Summary of Regression Analysis-------------------------

Formula: Y ~ <Yield> + <intercept>

Number of Observations:         38
Number of Degrees of Freedom:   2

R-squared:         0.0226
Adj R-squared:    -0.0046

Rmse:             12.5182

F-stat (1, 36):     0.8321, p-value:     0.3677

Degrees of Freedom: model 1, resid 36

-----------------------Summary of Estimated Coefficients------------------------
      Variable       Coef    Std Err     t-stat    p-value    CI 2.5%   CI 97.5%
--------------------------------------------------------------------------------
         Yield   146.6702   160.7874       0.91     0.3677  -168.4732   461.8135
     intercept    -4.6083     6.0652      -0.76     0.4523   -16.4961     7.2795
---------------------------------End of Summary---------------------------------

在尝试滚动回归的情况下,这是pandas.ols的已知错误吗?数据量很小,显然没有任何缺陷可以防止在这种情况下滚动12到60的观察回归。

不查看调试器时得到的完整回溯:

  File "properties.pyx", line 31, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:28841)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 656, in beta
    return DataFrame(self._beta_raw,
  File "properties.pyx", line 31, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:28841)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 775, in _beta_raw
    beta, indices, mask = self._rolling_ols_call
  File "properties.pyx", line 31, in pandas.lib.cache_readonly.__get__ (pandas/lib.c:28841)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 789, in _rolling_ols_call
    return self._calc_betas(self._x_trans, self._y_trans)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 803, in _calc_betas
    cum_xx = self._cum_xx(x)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 865, in _cum_xx
    x_slice = slicer(x, date)
  File "/opt/epd/7.3-2_pandas0.12/lib/python2.7/site-packages/pandas/stats/ols.py", line 856, in slicer
    return df.values[i:i + 1, :]
TypeError: unsupported operand type(s) for +: 'slice' and 'int'

违规代码似乎属于Pandas 0.12中ols.py的此功能。

def _cum_xx(self, x):
    dates = self._index
    K = len(x.columns)
    valid = self._time_has_obs
    cum_xx = []

    slicer = lambda df, dt: df.truncate(dt, dt).values
    if not self._panel_model:
        _get_index = x.index.get_loc

        def slicer(df, dt):
            i = _get_index(dt)
            return df.values[i:i + 1, :]

    last = np.zeros((K, K))

    for i, date in enumerate(dates):
        if not valid[i]:
            cum_xx.append(last)
            continue

        x_slice = slicer(x, date)
        xx = last = last + np.dot(x_slice.T, x_slice)
        cum_xx.append(xx)

    return cum_xx

_get_indexx.index.get_loc的代理,表示它可以返回切片对象。但是下面的代码假设以这种方式获得的值i是一个整数,因此i+1是有意义的。

我找到了get_loc的来源。事实证明,x.index.get_locx.index._engine.get_loc的代理。在我的情况下,错误发生时相关_engine_type的{​​{1}}仅为indexdefined in this source location并且ObjectEngine定义在那里:

get_loc

我正在调查何时/为什么cpdef get_loc(self, object val): if is_definitely_invalid_key(val): raise TypeError if self.over_size_threshold and self.is_monotonic: if not self.is_unique: return self._get_loc_duplicates(val) values = self._get_index_values() loc = _bin_search(values, val) # .searchsorted(val, side='left') if util.get_value_at(values, loc) != val: raise KeyError(val) return loc self._ensure_mapping_populated() if not self.unique: return self._get_loc_duplicates(val) self._check_type(val) try: return self.mapping.get_item(val) except TypeError: raise KeyError(val) 为我返回一个切片(在索引中肯定没有重复,这是文档建议的唯一方法)。与此同时,这些方面的任何建议都会有所帮助。

0 个答案:

没有答案