使用网格搜索调整模型

时间:2019-10-13 19:37:03

标签: python-3.x machine-learning scikit-learn

我在管道中使用TfidfVectorizer()遍历了有关使用Grid搜索和文本数据进行参数调整的示例。 据我所知,当我们调用grid_search.fit(X_train,y_train)时,它将转换数据,然后拟合模型,如字典中所述。但是,在评估期间,我对测试数据集有些困惑,因为当我们调用grid_search.predict(X_test)时,我不知道是否/如何将TfidfVectorizer()应用于此测试块。

谢谢

大卫

import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model.logistic import LogisticRegression
from sklearn.grid_search import GridSearchCV
from sklearn.pipeline import Pipeline
from sklearn.cross_validation import train_test_split
from sklearn.metrics import precision_score, recall_score, accuracy_
score
pipeline = Pipeline([
('vect', TfidfVectorizer(stop_words='english')),
('clf', LogisticRegression())
])
parameters = {
'vect__max_df': (0.25, 0.5, 0.75),
'vect__stop_words': ('english', None),
'vect__max_features': (2500, 5000, 10000, None),
'vect__ngram_range': ((1, 1), (1, 2)),
'vect__use_idf': (True, False),
'vect__norm': ('l1', 'l2'),
'clf__penalty': ('l1', 'l2'),
'clf__C': (0.01, 0.1, 1, 10),
}

if __name__ == "__main__":
grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1,
verbose=1, scoring='accuracy', cv=3)
df = pd.read_csv('data/sms.csv')
X, y, = df['message'], df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y)
grid_search.fit(X_train, y_train)
print 'Best score: %0.3f' % grid_search.best_score_
print 'Best parameters set:'
best_parameters = grid_search.best_estimator_.get_params()
for param_name in sorted(parameters.keys()):
print '\t%s: %r' % (param_name, best_parameters[param_name])
predictions = grid_search.predict(X_test)
print 'Accuracy:', accuracy_score(y_test, predictions)
print 'Precision:', precision_score(y_test, predictions)
print 'Recall:', recall_score(y_test, predictions)

1 个答案:

答案 0 :(得分:1)

这是scikit-learn管道魔术的示例。它是这样的:

  1. 首先,您使用Enter two positive single digit integers: 5 6 30 60 90 120 150 180 210 240 270 300 330 360 390 420 450 480 510 540 570 600 630 660 690 720 750 780 810 840 870 900 930 960 990 构造函数定义管道的元素-所有数据,无论是训练还是测试(Pipeline)阶段,都将通过所有已定义的步骤进行处理-在这种情况下,由{ {1}},然后输出将传递到predict模型。
  2. 将定义的管道传递给TfidfVectorizer构造函数使您可以使用方法LogisticRegression,该方法不仅执行网格搜索,而且还内部设置了GridSearchCVfit以达到最佳效果参数,因此稍后运行TfidfVectorizer会对最佳模型执行此操作。

您可以在scikit-learn documentation中找到有关创建管道的更多信息。