Question

我想在多元回归分析中选择变量。我尝试使用此代码http://planspace.org/20150423-forward_selection_with_statsmodels/。问题是我想从50个变量中选择并且需要花费太多时间。我已经使用Numba来加快速度，并编写了以下代码：

@jit
def forward_selected(data, response):
"""Linear model designed by forward selection.

Parameters:
-----------
data : pandas DataFrame with all possible predictors and response

response: string, name of response column in data

Returns:
--------
model: an "optimal" fitted statsmodels linear model
       with an intercept
       selected by forward selection
       evaluated by adjusted R-squared
"""
remaining = set(data.columns)
remaining.remove(response)
selected = [str]
current_score, best_new_score = 0.0, 0.0
while remaining and current_score == best_new_score:
    scores_with_candidates = [str]
    for candidate in remaining:
        formula = "{} ~ {} + 1".format(response,
                                       ' + '.join(selected + [candidate]))
        score = smf.ols(formula, data).fit().rsquared_adj
        scores_with_candidates.append((score, candidate))
    scores_with_candidates.sort()
    best_new_score, best_candidate = scores_with_candidates.pop()
    if current_score < best_new_score:
        remaining.remove(best_candidate)
        selected.append(best_candidate)
        current_score = best_new_score
formula = "{} ~ {} + 1".format(response,
                               ' + '.join(selected))
model = smf.ols(formula, data).fit()
return model

model = forward_selected(df, col)

但它返回以下错误：

TypeError：序列项0：预期的str实例，找到的类型

请告诉我如何修复它。如果您不理解我的问题，我很乐意在评论中提供更多信息。

追踪（最近一次呼叫最后一次）：

文件“〜/ PycharmProjects / anacondaenv / touhu_1.py”，第164行，

submit = forecast（col）

文件“〜/ PycharmProjects / anacondaenv / touhu_1.py”，第75行，预测

model = forward_selected（df，col）   TypeError：序列项0：预期的str实例，找到类型

Answer 1

我认为查看numba是否真正起作用的最好方法之一是尝试njit而不是jit装饰器。 njit强制no-python-mode并且如果有任何东西回落到python（它根本不提供速度效益）则会中断。简短回答：除了np.ndarrays之外，不要使用任何其他内容。所以没有字符串，没有元组，没有列表和 NO 调用非jitted函数。

所以我修复了错误：numba不允许在主函数体中使用空列表...不确定为什么（也许是一个bug？！）但是如果你在while块中移动它会有效。

import statsmodels.formula.api as smf
import numba as nb

@nb.jit
def forward_selected_nojit(data, response):
    """Linear model designed by forward selection.

    Parameters:
    -----------
    data : pandas DataFrame with all possible predictors and response

    response: string, name of response column in data

    Returns:
    --------
    model: an "optimal" fitted statsmodels linear model
           with an intercept
           selected by forward selection
           evaluated by adjusted R-squared
    """
    remaining = set(data.columns)
    remaining.remove(response)
    selected = None  # Changed this line
    current_score, best_new_score = 0.0, 0.0
    while remaining and current_score == best_new_score:
        if selected is None:  # Changed this and next line
            selected = []
        scores_with_candidates = []
        for candidate in remaining:
            formula = "{} ~ {} + 1".format(response,
                                           ' + '.join(selected + [candidate]))
            score = smf.ols(formula, data).fit().rsquared_adj
            scores_with_candidates.append((score, candidate))
        scores_with_candidates.sort()
        best_new_score, best_candidate = scores_with_candidates.pop()
        if current_score < best_new_score:
            remaining.remove(best_candidate)
            selected.append(best_candidate)
            current_score = best_new_score
    formula = "{} ~ {} + 1".format(response,
                                   ' + '.join(selected))
    model = smf.ols(formula, data).fit()
    return model

这可能会以更好的方式解决，但重要的是时间安排。但首先检查一下numba是否有任何奇怪的东西：

# With numba
sl ~ rk + yr + 1
0.835190760538

# Without numba
sl ~ rk + yr + 1
0.835190760538

所以现在结果是一样的让我们看看他们的表现如何：

# with numba
10 loops, best of 3: 264 ms per loop

# without numba
10 loops, best of 3: 252 ms per loop

所以它与我的预期完全一样。使用python类型并调用未经过调整的外部函数，您无法获得任何速度增益。您可以使用numba加快速度，但请务必仔细阅读numba文档并查看支持的内容：Python types和Numpy Types

Numba错误“序列项0：预期str实例，找到类型”

1 个答案: