我正在尝试在python中使用email = "test_email@test.org";
sql = string.Format(@"
SELECT BikeID, Checkout, StationIDout, Checkin, StationIDin
FROM History INNER JOIN
Customers ON Customers.CustomerID = History.CustomerID
AND Customers.Email = {0}
WHERE Checkin IS NOT NULL
ORDER BY Checkin DESC", email);
来在Pandas statsmodels
中输入一些值。
下面的第三次和第四次尝试(df2和df3)给出错误DataFrame
这似乎是一个奇怪的错误,因为数据帧永远不会有这样的属性。
在任何情况下,我都不明白我应该传递给predict()以便预测df2中A的缺失值。如果df3案例会给我一个包含最后一个元素的np.nan的预测,那也许会很好。
*** AttributeError: 'DataFrame' object has no attribute 'design_info'
使用预发布statsmodels进行更新
使用statsmodels 0.8的新版本候选版本,上面的df2示例现在可以正常工作。
但是,第三个(df3)示例在import pandas as pd
import numpy as np
import statsmodels.formula.api as sm
df0 = pd.DataFrame({"A": [10,20,30,324,2353,],
"B": [20, 30, 10, 100, 2332],
"C": [0, -30, 120, 11, 2]})
result0 = sm.ols(formula="A ~ B + C ", data=df0).fit()
print result0.summary()
test0 = result0.predict(df0) #works
print test0
df1 = pd.DataFrame({"A": [10,20,30,324,2353,],
"B": [20, 30, 10, 100, 2332],
"C": [0, -30, 120, 11, 2]})
result1 = sm.ols(formula="A ~ B+ I(C**2) ", data=df1).fit()
print result1.summary()
test1 = result1.predict(df1) #works
print test1
df2 = pd.DataFrame({"A": [10,20,30,324,2353,np.nan],
"B": [20, 30, 10, 100, 2332, 2332],
"C": [0, -30, 120, 11, 2, 2 ]})
result2 = sm.ols(formula="A ~ B + C", data=df2).fit()
print result2.summary()
test2 = result2.predict(df2) # Fails
newvals=df2[['B','C']].dropna()
test2 = result2.predict(newvals) # Fails
test2 = result2.predict(dict([[vv,df2[vv].values] for vv in newvals.columns])) # Fails
df3 = pd.DataFrame({"A": [10,20,30,324,2353,2353],
"B": [20, 30, 10, 100, 2332, np.nan],
"C": [0, -30, 120, 11, 2, 2 ]})
result3 = sm.ols(formula="A ~ B + C", data=df3).fit()
print result3.summary()
test3 = result3.predict(df3) # Fails
上失败
result3.predict(df3)
删除包含np.nan的最后一行,即
ValueError: Wrong number of items passed 5, placement implies 6
可以正确预测可以进行预测的行。
仍然可以选择传递整个df3,但接收np.nan作为最后一行的预测。
答案 0 :(得分:0)
通过回答这个问题,这是我的结果方法,用任意(OLS)模型填充数据框中的一些值。它会在预测之前根据需要丢弃np.nans。
#!/usr/bin/python
import statsmodels.formula.api as sm
import pandas as pd
import numpy as np
def df_impute_values_ols(adf,outvar,model, verbose=True):
"""Specify a Pandas DataFrame with some null (eg. np.nan) values in column <outvar>.
Specify a string model (in statsmodels format, which is like R) to use to predict them when they are missing. Nonlinear transformations can be specified in this string.
e.g.: model=' x1 + np.sin(x1) + I((x1-5)**2) '
At the moment, this uses OLS, so outvar should be continuous.
Two dfs are returned: one containing just the updated rows and a
subset of columns, and version of the incoming DataFrame with some
null values filled in (those that have the model variables) will
be returned, using single imputation.
This is written to work with statsmodels 0.6.1 (see https://github.com/statsmodels/statsmodels/issues/2171 ) ie this is written in order to avoid ANY NaN's in the modeldf. That should be less necessary in future versions.
To do:
- Add plots to verbose mode
- Models other than OLS should be offered
Issues:
- the "horrid kluge" line below will give trouble if there are
column names that are part of other column names. This kludge should be
temporary, anyway, until statsmodels 0.8 is fixed and released.
The latest version of this method will be at
https://github.com/cpbl/cpblUtilities/ in stats/
"""
formula=outvar+' ~ '+model
rhsv=[vv for vv in adf.columns if vv in model] # This is a horrid kluge! Ne
updateIndex= adf[pd.isnull(adf[outvar]) ] [rhsv].dropna().index
modeldf=adf[[outvar]+rhsv].dropna()
results=sm.ols(formula, data=modeldf).fit()
if verbose:
print results.summary()
newvals=adf[pd.isnull(adf[outvar])][rhsv].dropna()
newvals[outvar] = results.predict(newvals)
adf.loc[updateIndex,outvar]=newvals[outvar]
if verbose:
print(' %d rows updated for %s'%(len(newvals),outvar))
return(newvals, adf)
def test_df_impute_values_ols():
# Find missing values and fill them in:
df = pd.DataFrame({"A": [10, 20, 30, 324, 2353, np.nan],
"B": [20, 30, 10, 100, 2332, 2332],
"C": [0, np.nan, 120, 11, 2, 2 ]})
newv,df2=df_impute_values_ols(df,'A',' B + C ', verbose=True)
print df2
assert df2.iloc[-1]['A']==2357.5427562610648
assert df2.size==18
# Can we handle some missing values which also have missing predictors?
df = pd.DataFrame({"A": [10, 20, 30, 324, 2353, np.nan, np.nan],
"B": [20, 30, 10, 100, 2332, 2332, 2332],
"C": [0, np.nan, 120, 11, 2, 2, np.nan ]})
newv,df2=df_impute_values_ols(df,'A',' B + C + I(C**2) ', verbose=True)
print df2
assert pd.isnull( df2.iloc[-1]['A'] )
assert df2.iloc[-2]['A'] == 2352.999999999995