Python Sklearn Pipelines with GridSearchCV

时间:2016-04-12 00:40:41

标签: python-3.x machine-learning scikit-learn classification pipeline

I'm playing around with GridSearchCV and Pipeline in sklearn. I'm finding inconsistencies, and I'm wondering if I'm just misunderstanding something.

The following two code blocks yield very different predicted y_pred results, and I was hoping someone could clarify why.

Code block 1:

rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 2016, verbose = 1)
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
tsvd = TruncatedSVD(n_components=10, random_state = 2016)
clf = pipeline.Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),  
                        ('txt1', pipeline.Pipeline([('s1', cust_txt_col(key='search_term')), ('tfidf1', tfidf), ('tsvd1', tsvd)]))
                        ],
                    transformer_weights = {
                        'cst': 1.0,
                        'txt1': 0.5
                        },
                )), 
        ('rfr', rfr)]) # <-- this part is removed in CodeBlock2
param_grid = {'rfr__max_features': [10], 'rfr__max_depth': [20]}
model = grid_search.GridSearchCV(estimator = clf, param_grid = param_grid, cv = 2, verbose = 20, scoring=RMSE) 
model.fit(X_train, y_train)
y_pred_orig = model.predict(X_test)

Code block 2:

rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 2016, verbose = 1)
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
tsvd = TruncatedSVD(n_components=10, random_state = 2016)
clf = pipeline.Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),  
                        ('txt1', pipeline.Pipeline([('s1', cust_txt_col(key='search_term')), ('tfidf1', tfidf), ('tsvd1', tsvd)]))
                        ],
                    transformer_weights = {
                        'cst': 1.0,
                        'txt1': 0.5
                        },
                ))
    ])

X_train_trans = clf.fit_transform(X_train,y_train)
X_test_trans = clf.transform(X_test)

param_grid = {'max_features': [10], 'max_depth': [20]}
model = grid_search.GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 2, verbose = 20, scoring=RMSE)
model.fit(X_train_trans, y_train)
y_pred_mod = model.predict(X_test_trans)

In the first code block, my RandomForestRegressor classifier is built into the pipeline, whereas in the second code block I abstracted it out. Any ideas why y_pred_orig is vastly different from y_pred_mod, when, in my opinion, I'm tracing the same steps the pipeline would trace?

EDIT: For what it's worth, code block 1 yields a much lower test RMSE, whereas code block 2 yields a much lower train RMSE. These results confuse me, as I thought both code blocks should yield identical results given that I'm using the same random_state.

0 个答案:

没有答案