Question

当我有大量的数据点时，我可以使用statsmodel的WLS（weighted least squares regression）。但是，当我尝试将WLS用于数据集中的单个样本时，我似乎遇到了numpy数组的问题。

我的意思是，如果我的数据集X是一个2D数组，有很多行，WLS工作正常。但是，如果我尝试在一行上工作，那就不行了。您将在下面的代码中明白我的意思：

import sys
from sklearn.externals.six.moves import xrange
from sklearn.metrics import accuracy_score
import pylab as pl
from sklearn.externals.six.moves import zip
import numpy as np
import statsmodels.api as sm
from statsmodels.sandbox.regression.predstd import wls_prediction_std

# this is my dataset X, with 10 rows
X = np.array([[1,2,3],[1,2,3],[4,5,6],[1,2,3],[4,5,6],[1,2,3],[1,2,3],[4,5,6],[4,5,6],[1,2,3]])
# this is my response vector, y, also with 10 rows
y = np.array([1, 1, 0, 1, 0, 1, 1, 0, 0, 1])
# weights, 10 rows
weights = np.array([ 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1, 0.1 , 0.1 ])

# the line below, using all 10 rows of X, gives no errors but is commented out
# mod_wls = sm.WLS(y, X, weights)
# and this is the line I need, which is giving errors:
mod_wls = sm.WLS(np.array(y[0]), np.array([X[0]]),np.array([weights[0]]))

上面的最后一行最初只是mod_wls = sm.WLS(y[0], X[0], weights[0])

但是这给了我object of type 'numpy.float64' has no len()之类的错误，因此我把它们变成了数组。但现在我不断收到这个错误：

Traceback (most recent call last):
  File "C:\Users\app\Documents\Python Scripts\test.py", line 53, in <module>
    mod_wls = sm.WLS(np.array(y[0]), np.array([X[0]]),np.array([weights[0]]))
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 383, in __init__
    weights=weights, hasconst=hasconst)
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\regression\linear_model.py", line 79, in __init__
    super(RegressionModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\model.py", line 136, in __init__
    super(LikelihoodModel, self).__init__(endog, exog, **kwargs)
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\model.py", line 52, in __init__
    self.data = handle_data(endog, exog, missing, hasconst, **kwargs)
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 401, in handle_data
    return klass(endog, exog=exog, missing=missing, hasconst=hasconst, **kwargs)
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 78, in __init__
    self._check_integrity()
  File "C:\Users\app\Anaconda\lib\site-packages\statsmodels\base\data.py", line 249, in _check_integrity
    print len(self.endog)
TypeError: len() of unsized object

所以为了看看长度有什么问题，我做了这个：

print "y size: "
print len(np.array([y[0]]))
print "X size"
print len (np.array([X[0]]))
print "weights size"
print len(np.array([weights[0]]))

得到了这个输出：

y size: 
1
X size
1
weights size
1

然后我尝试了这个：

print "x shape"
print X[0].shape
print "y shape"
print y[0].shape

输出结果为：

x shape
(3L,)
y shape
()

data.py中的第249行（错误引用）具有此功能，我在其中添加了一堆“打印尺寸”以查看发生的情况：

def _check_integrity(self):
    if self.exog is not None:
        print "exog size: " 
        print len(self.exog)            
        print "endog size"
        print len(self.endog) # <-- this, and the line below are causing the error
        if len(self.exog) != len(self.endog):
            raise ValueError("endog and exog matrices are different sizes")

len(self.endog)似乎有问题。虽然当我尝试打印len(np.array([y[0]]))时，它只是输出1。但不知怎的，当y进入check_integrity函数并变为endog时，它的行为不一样......或者是其他事情还在继续？

我该怎么办？我正在使用一种算法，我确实需要分别为X的每一行运行WLS。

Answer 1

对于一次观察，没有WLS这样的事情。当它们被归一化为1时，单个重量将变为1。如果你想这样做，虽然我没有，你只是使用OLS。解决方案将是SVD的结果，而不是数据中的任何实际关系。

使用pinv / svd的OLS解决方案

np.dot(np.linalg.pinv(X[[0]]), y[0])

虽然你可以弥补任何有效的答案并获得相同的结果。我不确定SVD解决方案的特性与其他非独特解决方案的确切关系。

[~/]
[26]: beta = [-.5, .25, 1/3.]

[~/]
[27]: np.dot(beta, X[0])
[27]: 1.0

Python错误：使用带有一行数据的statsmodels时未升级对象的len（）

1 个答案: