我有一个循环,每次循环都会增加多项式特征的程度。当前,循环覆盖了模型变量名称,并且在循环结束时,我只能访问我创建的最后一个模型对象:
logitCV = LogisticRegressionCV(class_weight='balanced', random_state=42, cv=5, scoring='accuracy')
comparisons = pd.DataFrame(columns = 'model data accuracy'.split())
dims = np.arange(1,4,1)
for i in dims:
poly = PolynomialFeatures(degree=i,include_bias=False)
X_poly_train = poly.fit_transform(X_train)
X_poly_test = poly.fit_transform(X_test)
model = logitCV.fit(X_poly_train, y_train)
train_score = model.score(X_poly_train,y_train)
test_score = model.score(X_poly_test,y_test)
model_name = 'dims_{}'.format(i)
add_train = [model_name,'train',train_score]
comparisons.loc[len(comparisons)] = add_train
add_test = [model_name,'test',test_score]
comparisons.loc[len(comparisons)] = add_test
理想情况下,这将为我使用的每组功能返回一个模型对象。在上述情况下,存在三个模型(y = X
; y = X+X^2
; y = X+X^2+X^3
),因此在循环结束时应有三个模型对象(model_1; model_2; model_3
)可访问
感谢您的帮助!
答案 0 :(得分:2)
您要完成的工作称为网格搜索。 Sklearn具有内置类GridSearchCV
,可用于此确切目的。虽然您不会取回模型列表,但可以查看每个模型的结果并访问性能最佳的模型。为了与PolynomialFeatures
一起使用,我也鼓励使用Pipeline
。例如:
from sklearn.datasets import load_iris
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV, train_test_split
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import PolynomialFeatures
iris = load_iris()
X_train, X_test, y_train, y_test = train_test_split(iris.data, iris.target)
pipe = Pipeline(steps=[('poly', PolynomialFeatures()), ('lr', LogisticRegression())])
params = {'poly__degree': np.arange(1, 4)}
gs = GridSearchCV(pipe, params, return_train_score=True)
gs.fit(X_train, y_train)
GridSearchCV(cv=None, error_score='raise',
estimator=Pipeline(memory=None,
steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction
_only=False)), ('lr', LinearRegression(copy_X=True, fit_intercept=True, n_jobs=1
, normalize=False))]),
fit_params=None, iid=True, n_jobs=1,
param_grid={'poly__degree': [1, 2, 3, 4]}, pre_dispatch='2*n_jobs',
refit=True, return_train_score='warn', scoring=None, verbose=0)
gs.cv_results
{'mean_fit_time': array([0.00133387, 0.00099603, 0.00133324, 0.00199993]),
'mean_score_time': array([0.00066773, 0.00099413, 0.00100025, 0.00100017]),
'mean_test_score': array([ 0.90775274, 0.91685398, 0.80601582, -40.5437895
4]),
'mean_train_score': array([0.92144066, 0.95029226, 0.95571164, 0.98727079]),
'param_poly__degree': masked_array(data=[1, 2, 3, 4],
mask=[False, False, False, False],
fill_value='?',
dtype=object),
'params': [{'poly__degree': 1},
{'poly__degree': 2},
{'poly__degree': 3},
{'poly__degree': 4}],
'rank_test_score': array([2, 1, 3, 4]),
'split0_test_score': array([ 0.88284837, 0.88510265, 0.73325603, -10.01478
051]),
'split0_train_score': array([0.93086987, 0.96444943, 0.98005722, 0.99820903]),
'split1_test_score': array([ 0.92250837, 0.9227331 , 0.88028476, -12.49501
116]),
'split1_train_score': array([0.91665687, 0.94718893, 0.96290854, 0.99867128]),
'split2_test_score': array([ 0.91857458, 0.94358434, 0.80647314, -99.94668
53 ]),
'split2_train_score': array([0.91679523, 0.93923843, 0.92416916, 0.96493206]),
'std_fit_time': array([4.70942072e-04, 5.50718821e-06, 4.71538951e-04, 1.123915
96e-07]),
'std_score_time': array([4.72159663e-04, 7.86741172e-06, 1.12391596e-07, 1.9466
7955e-07]),
'std_test_score': array([1.79179093e-02, 2.42798791e-02, 6.01535692e-02, 4.1735
5600e+01]),
'std_train_score': array([0.0066677 , 0.01052367, 0.02337684, 0.01579699])}
gs.best_estimator_
Pipeline(memory=None,
steps=[('poly', PolynomialFeatures(degree=2, include_bias=True, interaction
_only=False)), ('lr', LogisticRegression(C=1.0, class_weight=None, dual=False, f
it_intercept=True,
intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
verbose=0, warm_start=False))])
gs.best_estimator.score(X_test, y_test)
0.9736842105263158
答案 1 :(得分:0)
我建议创建一个模型列表,并将循环中的每个模型附加到列表中。
示例:
models = []
# ...
for i in dims:
# ...
model = logitCV.fit(X_poly_train, y_train)
models += [model]
然后,循环结束后,您将可以访问该列表中的每个模型。