如何在 python 中执行逐步回归?在SCIPY中有OLS的方法,但我不能逐步进行。在这方面的任何帮助都将是一个很大的帮助。感谢。
编辑:我正在尝试构建线性回归模型。我有5个自变量并使用前向逐步回归,我的目标是选择变量,使我的模型具有最低的p值。以下链接解释了目标:
再次感谢。
答案 0 :(得分:9)
答案 1 :(得分:3)
您可以根据pytest_generate_tests
模型进行前后选择,如in this answer所示。
但是,this answer描述了为什么不应该首先使用逐步选择计量经济模型。
答案 2 :(得分:2)
Statsmodels还有其他的回归方法:http://statsmodels.sourceforge.net/devel/examples/generated/example_ols.html。我认为这将有助于您实施逐步回归。
答案 3 :(得分:1)
您可以尝试mlxtend,它具有多种选择方法。
from mlxtend.feature_selection import SequentialFeatureSelector as sfs
clf = LinearRegression()
# Build step forward feature selection
sfs1 = sfs(clf,k_features = 10,forward=True,floating=False, scoring='r2',cv=5)
# Perform SFFS
sfs1 = sfs1.fit(X_train, y_train)
答案 4 :(得分:0)
"""Importing the api class from statsmodels"""
import statsmodels.formula.api as sm
"""X_opt variable has all the columns of independent variables of matrix X
in this case we have 5 independent variables"""
X_opt = X[:,[0,1,2,3,4]]
"""Running the OLS method on X_opt and storing results in regressor_OLS"""
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
使用摘要方法,您可以在内核中检查您的p值 写为'P> | t |'的变量。然后检查具有最高p的变量 值。假设x3具有最高值,例如0.956。然后删除此列 从你的数组中重复所有步骤。
X_opt = X[:,[0,1,3,4]]
regressor_OLS = sm.OLS(endog = y, exog = X_opt).fit()
regressor_OLS.summary()
重复这些方法,直到删除p值高于显着性值的所有列(例如0.05)。最后,变量X_opt将具有p值小于显着性水平的所有最优变量。
答案 5 :(得分:0)
这是我刚刚写的一种使用“混合选择”的方法,如《统计学习入门》中所述。作为输入,它需要:
lm,一个statsmodels.OLS.fit(Y,X),其中X是n个数组,其中n是 数据点数和Y,其中Y是训练数据中的响应
curr_preds-具有['const']
potential_preds-所有潜在预测变量的列表。 还需要有一个熊猫数据框X_mix,其中包含所有数据(包括“ const”)以及与潜在预测变量相对应的所有数据
tol,可选。最大p值,如果未指定,则为.05
def mixed_selection (lm, curr_preds, potential_preds, tol = .05):
while (len(potential_preds) > 0):
index_best = -1 # this will record the index of the best predictor
curr = -1 # this will record current index
best_r_squared = lm.rsquared_adj # record the r squared of the current model
# loop to determine if any of the predictors can better the r-squared
for pred in potential_preds:
curr += 1 # increment current
preds = curr_preds.copy() # grab the current predictors
preds.append(pred)
lm_new = sm.OLS(y, X_mix[preds]).fit() # create a model with the current predictors plus an addional potential predictor
new_r_sq = lm_new.rsquared_adj # record r squared for new model
if new_r_sq > best_r_squared:
best_r_squared = new_r_sq
index_best = curr
if index_best != -1: # a potential predictor improved the r-squared; remove it from potential_preds and add it to current_preds
curr_preds.append(potential_preds.pop(index_best))
else: # none of the remaining potential predictors improved the adjust r-squared; exit loop
break
# fit a new lm using the new predictors, look at the p-values
pvals = sm.OLS(y, X_mix[curr_preds]).fit().pvalues
pval_too_big = []
# make a list of all the p-values that are greater than the tolerance
for feat in pvals.index:
if(pvals[feat] > tol and feat != 'const'): # if the pvalue is too large, add it to the list of big pvalues
pval_too_big.append(feat)
# now remove all the features from curr_preds that have a p-value that is too large
for feat in pval_too_big:
pop_index = curr_preds.index(feat)
curr_preds.pop(pop_index)
答案 6 :(得分:0)
也许 toad.selection.stepwise
包中的 toad
可以解决您的问题。
这里是github链接:https://github.com/amphibian-dev/toad
以及github web中的示例:
toad.selection.stepwise(data_woe,target = 'target', estimator='ols', direction = 'both', criterion = 'aic', exclude = to_drop)
希望这有效!
答案 7 :(得分:-1)
我开发了这个存储库https://github.com/xinhe97/StepwiseSelectionOLS
我的逐步选择类(最佳子集,向前逐步,向后逐步)与sklearn兼容。您可以对我的课程进行Pipeline和GridSearchCV。
我的代码的基本部分如下:
################### Criteria ###################
def processSubset(self, X,y,feature_index):
# Fit model on feature_set and calculate rsq_adj
regr = sm.OLS(y,X[:,feature_index]).fit()
rsq_adj = regr.rsquared_adj
bic = self.myBic(X.shape[0], regr.mse_resid, len(feature_index))
rsq = regr.rsquared
return {"model":regr, "rsq_adj":rsq_adj, "bic":bic, "rsq":rsq, "predictors_index":feature_index}
################### Forward Stepwise ###################
def forward(self,predictors_index,X,y):
# Pull out predictors we still need to process
remaining_predictors_index = [p for p in range(X.shape[1])
if p not in predictors_index]
results = []
for p in remaining_predictors_index:
new_predictors_index = predictors_index+[p]
new_predictors_index.sort()
results.append(self.processSubset(X,y,new_predictors_index))
# Wrap everything up in a nice dataframe
models = pd.DataFrame(results)
# Choose the model with the highest rsq_adj
# best_model = models.loc[models['bic'].idxmin()]
best_model = models.loc[models['rsq'].idxmax()]
# Return the best model, along with model's other information
return best_model
def forwardK(self,X_est,y_est, fK):
models_fwd = pd.DataFrame(columns=["model", "rsq_adj", "bic", "rsq", "predictors_index"])
predictors_index = []
M = min(fK,X_est.shape[1])
for i in range(1,M+1):
print(i)
models_fwd.loc[i] = self.forward(predictors_index,X_est,y_est)
predictors_index = models_fwd.loc[i,'predictors_index']
print(models_fwd)
# best_model_fwd = models_fwd.loc[models_fwd['bic'].idxmin(),'model']
best_model_fwd = models_fwd.loc[models_fwd['rsq'].idxmax(),'model']
# best_predictors = models_fwd.loc[models_fwd['bic'].idxmin(),'predictors_index']
best_predictors = models_fwd.loc[models_fwd['rsq'].idxmax(),'predictors_index']
return best_model_fwd, best_predictors