I'm playing around with GridSearchCV
and Pipeline
in sklearn. I'm finding inconsistencies, and I'm wondering if I'm just misunderstanding something.
The following two code blocks yield very different predicted y_pred
results, and I was hoping someone could clarify why.
Code block 1:
rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 2016, verbose = 1)
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
tsvd = TruncatedSVD(n_components=10, random_state = 2016)
clf = pipeline.Pipeline([
('union', FeatureUnion(
transformer_list = [
('cst', cust_regression_vals()),
('txt1', pipeline.Pipeline([('s1', cust_txt_col(key='search_term')), ('tfidf1', tfidf), ('tsvd1', tsvd)]))
],
transformer_weights = {
'cst': 1.0,
'txt1': 0.5
},
)),
('rfr', rfr)]) # <-- this part is removed in CodeBlock2
param_grid = {'rfr__max_features': [10], 'rfr__max_depth': [20]}
model = grid_search.GridSearchCV(estimator = clf, param_grid = param_grid, cv = 2, verbose = 20, scoring=RMSE)
model.fit(X_train, y_train)
y_pred_orig = model.predict(X_test)
Code block 2:
rfr = RandomForestRegressor(n_estimators = 1, n_jobs = -1, random_state = 2016, verbose = 1)
tfidf = TfidfVectorizer(ngram_range=(1, 1), stop_words='english')
tsvd = TruncatedSVD(n_components=10, random_state = 2016)
clf = pipeline.Pipeline([
('union', FeatureUnion(
transformer_list = [
('cst', cust_regression_vals()),
('txt1', pipeline.Pipeline([('s1', cust_txt_col(key='search_term')), ('tfidf1', tfidf), ('tsvd1', tsvd)]))
],
transformer_weights = {
'cst': 1.0,
'txt1': 0.5
},
))
])
X_train_trans = clf.fit_transform(X_train,y_train)
X_test_trans = clf.transform(X_test)
param_grid = {'max_features': [10], 'max_depth': [20]}
model = grid_search.GridSearchCV(estimator = rfr, param_grid = param_grid, cv = 2, verbose = 20, scoring=RMSE)
model.fit(X_train_trans, y_train)
y_pred_mod = model.predict(X_test_trans)
In the first code block, my RandomForestRegressor classifier is built into the pipeline, whereas in the second code block I abstracted it out. Any ideas why y_pred_orig
is vastly different from y_pred_mod
, when, in my opinion, I'm tracing the same steps the pipeline
would trace?
EDIT: For what it's worth, code block 1 yields a much lower test RMSE, whereas code block 2 yields a much lower train RMSE. These results confuse me, as I thought both code blocks should yield identical results given that I'm using the same random_state
.