statsmodels:使用公式为result.predict()提供样本外预测的允许格式是什么?

时间:2016-06-27 03:05:14

标签: python pandas statsmodels

我正在尝试在python中使用email = "test_email@test.org"; sql = string.Format(@" SELECT BikeID, Checkout, StationIDout, Checkin, StationIDin FROM History INNER JOIN Customers ON Customers.CustomerID = History.CustomerID AND Customers.Email = {0} WHERE Checkin IS NOT NULL ORDER BY Checkin DESC", email); 来在Pandas statsmodels中输入一些值。

下面的第三次和第四次尝试(df2和df3)给出错误DataFrame这似乎是一个奇怪的错误,因为数据帧永远不会有这样的属性。

在任何情况下,我都不明白我应该传递给predict()以便预测df2中A的缺失值。如果df3案例会给我一个包含最后一个元素的np.nan的预测,那也许会很好。

*** AttributeError: 'DataFrame' object has no attribute 'design_info'

使用预发布statsmodels进行更新

使用statsmodels 0.8的新版本候选版本,上面的df2示例现在可以正常工作。 但是,第三个(df3)示例在import pandas as pd import numpy as np import statsmodels.formula.api as sm df0 = pd.DataFrame({"A": [10,20,30,324,2353,], "B": [20, 30, 10, 100, 2332], "C": [0, -30, 120, 11, 2]}) result0 = sm.ols(formula="A ~ B + C ", data=df0).fit() print result0.summary() test0 = result0.predict(df0) #works print test0 df1 = pd.DataFrame({"A": [10,20,30,324,2353,], "B": [20, 30, 10, 100, 2332], "C": [0, -30, 120, 11, 2]}) result1 = sm.ols(formula="A ~ B+ I(C**2) ", data=df1).fit() print result1.summary() test1 = result1.predict(df1) #works print test1 df2 = pd.DataFrame({"A": [10,20,30,324,2353,np.nan], "B": [20, 30, 10, 100, 2332, 2332], "C": [0, -30, 120, 11, 2, 2 ]}) result2 = sm.ols(formula="A ~ B + C", data=df2).fit() print result2.summary() test2 = result2.predict(df2) # Fails newvals=df2[['B','C']].dropna() test2 = result2.predict(newvals) # Fails test2 = result2.predict(dict([[vv,df2[vv].values] for vv in newvals.columns])) # Fails df3 = pd.DataFrame({"A": [10,20,30,324,2353,2353], "B": [20, 30, 10, 100, 2332, np.nan], "C": [0, -30, 120, 11, 2, 2 ]}) result3 = sm.ols(formula="A ~ B + C", data=df3).fit() print result3.summary() test3 = result3.predict(df3) # Fails 上失败 result3.predict(df3)

删除包含np.nan的最后一行,即 ValueError: Wrong number of items passed 5, placement implies 6可以正确预测可以进行预测的行。

仍然可以选择传递整个df3,但接收np.nan作为最后一行的预测。

1 个答案:

答案 0 :(得分:0)

通过回答这个问题,这是我的结果方法,用任意(OLS)模型填充数据框中的一些值。它会在预测之前根据需要丢弃np.nans。

#!/usr/bin/python
import statsmodels.formula.api as sm
import pandas as pd
import numpy as np

def df_impute_values_ols(adf,outvar,model,  verbose=True):
    """Specify a Pandas DataFrame with some null (eg. np.nan) values in column <outvar>.
    Specify a string model (in statsmodels format, which is like R) to use to predict them when they are missing. Nonlinear transformations can be specified in this string.

    e.g.: model='  x1 + np.sin(x1) + I((x1-5)**2) '

    At the moment, this uses OLS, so outvar should be continuous. 

    Two dfs are returned: one containing just the updated rows and a
    subset of columns, and version of the incoming DataFrame with some
    null values filled in (those that have the model variables) will
    be returned, using single imputation.

    This is written to work with statsmodels 0.6.1 (see https://github.com/statsmodels/statsmodels/issues/2171 ) ie this is written in order to avoid ANY NaN's in the modeldf. That should be less necessary in future versions.

    To do: 
    - Add plots to  verbose mode 
    - Models other than OLS should be offered

    Issues:
    - the "horrid kluge" line below will give trouble if there are        
      column names that are part of other column names. This kludge should be 
      temporary, anyway, until statsmodels 0.8 is fixed and released. 

    The latest version of this method will be at 
     https://github.com/cpbl/cpblUtilities/ in stats/
    """
    formula=outvar+' ~ '+model
    rhsv=[vv for vv in adf.columns if vv in model] # This is a horrid kluge! Ne
    updateIndex= adf[pd.isnull(adf[outvar]) ] [rhsv].dropna().index
    modeldf=adf[[outvar]+rhsv].dropna()
    results=sm.ols(formula, data=modeldf).fit()
    if verbose:
        print    results.summary()
    newvals=adf[pd.isnull(adf[outvar])][rhsv].dropna()
    newvals[outvar] = results.predict(newvals)
    adf.loc[updateIndex,outvar]=newvals[outvar]
    if verbose:
        print(' %d rows updated for %s'%(len(newvals),outvar))
    return(newvals, adf)


def test_df_impute_values_ols():
    # Find missing values and fill them in:
    df = pd.DataFrame({"A": [10, 20, 30, 324, 2353, np.nan],
                       "B": [20, 30, 10, 100, 2332, 2332],
                       "C": [0, np.nan, 120, 11, 2, 2 ]})
    newv,df2=df_impute_values_ols(df,'A',' B + C ',  verbose=True)
    print df2
    assert df2.iloc[-1]['A']==2357.5427562610648
    assert df2.size==18

    # Can we handle some missing values which also have missing predictors?
    df = pd.DataFrame({"A": [10, 20, 30,     324, 2353, np.nan, np.nan],
                       "B": [20, 30, 10,     100, 2332, 2332,   2332],
                       "C": [0, np.nan, 120, 11,   2,    2,     np.nan ]})
    newv,df2=df_impute_values_ols(df,'A',' B + C + I(C**2) ',  verbose=True)
    print df2

    assert pd.isnull(  df2.iloc[-1]['A'] )
    assert  df2.iloc[-2]['A'] == 2352.999999999995