我想在多元回归分析中选择变量。 我尝试使用此代码http://planspace.org/20150423-forward_selection_with_statsmodels/。 问题是我想从50个变量中选择并且需要花费太多时间。我已经使用Numba来加快速度,并编写了以下代码:
@jit
def forward_selected(data, response):
"""Linear model designed by forward selection.
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = [str]
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
scores_with_candidates = [str]
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response,
' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
model = forward_selected(df, col)
但它返回以下错误:
TypeError:序列项0:预期的str实例,找到的类型
请告诉我如何修复它。如果您不理解我的问题,我很乐意在评论中提供更多信息。
追踪(最近一次呼叫最后一次):
文件“〜/ PycharmProjects / anacondaenv / touhu_1.py”,第164行,
submit = forecast(col)
文件“〜/ PycharmProjects / anacondaenv / touhu_1.py”,第75行,预测
model = forward_selected(df,col) TypeError:序列项0:预期的str实例,找到类型
答案 0 :(得分:2)
我认为查看numba
是否真正起作用的最好方法之一是尝试njit
而不是jit
装饰器。 njit
强制no-python-mode
并且如果有任何东西回落到python(它根本不提供速度效益)则会中断。简短回答:除了np.ndarrays
之外,不要使用任何其他内容。所以没有字符串,没有元组,没有列表和 NO 调用非jitted函数。
所以我修复了错误:numba不允许在主函数体中使用空列表...不确定为什么(也许是一个bug?!)但是如果你在while
块中移动它会有效。
import statsmodels.formula.api as smf
import numba as nb
@nb.jit
def forward_selected_nojit(data, response):
"""Linear model designed by forward selection.
Parameters:
-----------
data : pandas DataFrame with all possible predictors and response
response: string, name of response column in data
Returns:
--------
model: an "optimal" fitted statsmodels linear model
with an intercept
selected by forward selection
evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = None # Changed this line
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
if selected is None: # Changed this and next line
selected = []
scores_with_candidates = []
for candidate in remaining:
formula = "{} ~ {} + 1".format(response,
' + '.join(selected + [candidate]))
score = smf.ols(formula, data).fit().rsquared_adj
scores_with_candidates.append((score, candidate))
scores_with_candidates.sort()
best_new_score, best_candidate = scores_with_candidates.pop()
if current_score < best_new_score:
remaining.remove(best_candidate)
selected.append(best_candidate)
current_score = best_new_score
formula = "{} ~ {} + 1".format(response,
' + '.join(selected))
model = smf.ols(formula, data).fit()
return model
这可能会以更好的方式解决,但重要的是时间安排。但首先检查一下numba是否有任何奇怪的东西:
# With numba
sl ~ rk + yr + 1
0.835190760538
# Without numba
sl ~ rk + yr + 1
0.835190760538
所以现在结果是一样的让我们看看他们的表现如何:
# with numba
10 loops, best of 3: 264 ms per loop
# without numba
10 loops, best of 3: 252 ms per loop
所以它与我的预期完全一样。使用python类型并调用未经过调整的外部函数,您无法获得任何速度增益。您可以使用numba加快速度,但请务必仔细阅读numba文档并查看支持的内容:Python types和Numpy Types