Python中的逐步回归

时间:2013-03-15 13:10:25

标签: python scipy regression

如何在 python 中执行逐步回归?在SCIPY中有OLS的方法,但我不能逐步进行。在这方面的任何帮助都将是一个很大的帮助。感谢。

编辑:我正在尝试构建线性回归模型。我有5个自变量并使用前向逐步回归,我的目标是选择变量,使我的模型具有最低的p值。以下链接解释了目标:

https://www.google.co.in/url?sa=t&rct=j&q=&esrc=s&source=web&cd=4&ved=0CEAQFjAD&url=http%3A%2F%2Fbusiness.fullerton.edu%2Fisds%2Fjlawrence%2FStat-On-Line%2FExcel%2520Notes%2FExcel%2520Notes%2520-%2520STEPWISE%2520REGRESSION.doc&ei=YjKsUZzXHoPwrQfGs4GQCg&usg=AFQjCNGDaQ7qRhyBaQCmLeO4OD2RVkUhzw&bvm=bv.47244034,d.bmk

再次感谢。

8 个答案:

答案 0 :(得分:9)

Trevor Smith和我用statsmodel为线性回归写了一个小前向选择函数:http://planspace.org/20150423-forward_selection_with_statsmodels/你可以很容易地修改它以最小化p值,或者选择基于beta p值而只需要更多的工作

答案 1 :(得分:3)

您可以根据pytest_generate_tests模型进行前后选择,如in this answer所示。

但是,this answer描述了为什么不应该首先使用逐步选择计量经济模型。

答案 2 :(得分:2)

Statsmodels还有其他的回归方法:http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html。我认为这将有助于您实施逐步回归。

答案 3 :(得分:1)

您可以尝试mlxtend,它具有多种选择方法。

from mlxtend.feature_selection import SequentialFeatureSelector as sfs

clf = LinearRegression()

# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)

# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)

答案 4 :(得分:0)

"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm

"""X_opt variable has all the columns of independent variables of matrix X 
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]

"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

使用摘要方法,您可以在内核中检查您的p值 写为'P> | t |'的变量。然后检查具有最高p的变量 值。假设x3具有最高值,例如0.956。然后删除此列 从你的数组中重复所有步骤。

X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()

重复这些方法,直到删除p值高于显着性值的所有列(例如0.05)。最后,变量X_opt将具有p值小于显着性水平的所有最优变量。

答案 5 :(得分:0)

这是我刚刚写的一种使用“混合选择”的方法,如《统计学习入门》中所述。作为输入,它需要:

  • lm,一个statsmodels.OLS.fit(Y,X),其中X是n个数组,其中n是 数据点数和Y,其中Y是训练数据中的响应

  • curr_preds-具有['const']

  • 的列表
  • potential_preds-所有潜在预测变量的列表。 还需要有一个熊猫数据框X_mix,其中包含所有数据(包括“ const”)以及与潜在预测变量相对应的所有数据

  • tol,可选。最大p值,如果未指定,则为.05

def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
  while (len(potential_preds) > 0):
    index_best = -1 # this will record the index of the best predictor
    curr = -1 # this will record current index
    best_r_squared = lm.rsquared_adj # record the r squared of the current model
    # loop to determine if any of the predictors can better the r-squared  
    for pred in potential_preds:
      curr += 1 # increment current
      preds = curr_preds.copy() # grab the current predictors
      preds.append(pred)
      lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
      new_r_sq = lm_new.rsquared_adj # record r squared for new model
      if new_r_sq > best_r_squared:
        best_r_squared = new_r_sq
        index_best = curr

    if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
      curr_preds.append(potential_preds.pop(index_best))
    else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
      break

    # fit a new lm using the new predictors, look at the p-values
    pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
    pval_too_big = []
    # make a list of all the p-values that are greater than the tolerance 
    for feat in pvals.index:
      if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
        pval_too_big.append(feat)

    # now remove all the features from curr_preds that have a p-value that is too large
    for feat in pval_too_big:
      pop_index = curr_preds.index(feat)
      curr_preds.pop(pop_index)

答案 6 :(得分:0)

也许 toad.selection.stepwise 包中的 toad 可以解决您的问题。

这里是github链接:https://github.com/amphibian-dev/toad

以及github web中的示例:

toad.selection.stepwise(data_woe,target = 'target', estimator='ols', direction = 'both', criterion = 'aic', exclude = to_drop)

希望这有效!

答案 7 :(得分:-1)

我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS

我的逐步选择类(最佳子集,向前逐步,向后逐步)与sklearn兼容。您可以对我的课程进行Pipeline和GridSearchCV。

我的代码的基本部分如下:

################### Criteria ###################
def processSubset(self, X,y,feature_index):
    # Fit model on feature_set and calculate rsq_adj
    regr = sm.OLS(y,X[:,feature_index]).fit()
    rsq_adj = regr.rsquared_adj
    bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
    rsq = regr.rsquared
    return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}

################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
    # Pull out predictors we still need to process
    remaining_predictors_index = [p for p in range(X.shape[1])
                            if p not in predictors_index]

    results = []
    for p in remaining_predictors_index:
        new_predictors_index = predictors_index+[p]
        new_predictors_index.sort()
        results.append(self.processSubset(X,y,new_predictors_index))
        # Wrap everything up in a nice dataframe
    models = pd.DataFrame(results)
    # Choose the model with the highest rsq_adj
    # best_model = models.loc[models['bic'].idxmin()]
    best_model = models.loc[models['rsq'].idxmax()]
    # Return the best model, along with model's other  information
    return best_model

def forwardK(self,X_est,y_est, fK):
    models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
    predictors_index = []

    M = min(fK,X_est.shape[1])

    for i in range(1,M+1):
        print(i)
        models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
        predictors_index = models_fwd.loc[i,'predictors_index']

    print(models_fwd)
    # best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
    best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
    # best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
    best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
    return best_model_fwd, best_predictors