Question

I'm playing around with GridSearchCV and Pipeline in sklearn. I'm finding inconsistencies, and I'm wondering if I'm just misunderstanding something.

The following two code blocks yield very different predicted y_pred results, and I was hoping someone could clarify why.

Code block 1:

rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 2016, verbose = 1)
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
tsvd = TruncatedSVD(n_components=10, random_state = 2016)
clf = pipeline.Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),  
                        ('txt1', pipeline.Pipeline([('s1', cust_txt_col(key='search_term')), ('tfidf1', tfidf), ('tsvd1', tsvd)]))
                        ],
                    transformer_weights = {
                        'cst': 1.0,
                        'txt1': 0.5
                        },
                )), 
        ('rfr', rfr)]) # <-- this part is removed in CodeBlock2
param_grid = {'rfr__max_features': [10], 'rfr__max_depth': [20]}
model = grid_search.GridSearchCV(estimator = clf, param_grid = param_grid, cv = 2, verbose = 20, scoring=RMSE) 
model.fit(X_train, y_train)
y_pred_orig = model.predict(X_test)

Code block 2:

rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 2016, verbose = 1)
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
tsvd = TruncatedSVD(n_components=10, random_state = 2016)
clf = pipeline.Pipeline([
        ('union', FeatureUnion(
                    transformer_list = [
                        ('cst',  cust_regression_vals()),  
                        ('txt1', pipeline.Pipeline([('s1', cust_txt_col(key='search_term')), ('tfidf1', tfidf), ('tsvd1', tsvd)]))
                        ],
                    transformer_weights = {
                        'cst': 1.0,
                        'txt1': 0.5
                        },
                ))
    ])

X_train_trans = clf.fit_transform(X_train,y_train)
X_test_trans = clf.transform(X_test)

param_grid = {'max_features': [10], 'max_depth': [20]}
model = grid_search.GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 2, verbose = 20, scoring=RMSE)
model.fit(X_train_trans, y_train)
y_pred_mod = model.predict(X_test_trans)

In the first code block, my RandomForestRegressor classifier is built into the pipeline, whereas in the second code block I abstracted it out. Any ideas why y_pred_orig is vastly different from y_pred_mod, when, in my opinion, I'm tracing the same steps the pipeline would trace?

EDIT: For what it's worth, code block 1 yields a much lower test RMSE, whereas code block 2 yields a much lower train RMSE. These results confuse me, as I thought both code blocks should yield identical results given that I'm using the same random_state.

Python Sklearn Pipelines with GridSearchCV

0 个答案: