我有一个与循环有关的问题,该循环包括带有矢量化模型的管道:
%%time
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)
data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
data.label, random_state=0)
classifiers = [
LogisticRegression(random_state=0),
LinearSVC(random_state=0),
KNeighborsClassifier()
]
models = [
CountVectorizer(preprocessor=' '.join, tokenizer=None),
TfidfVectorizer(preprocessor=' '.join, tokenizer=None),
D2VTransformer(dm=1, size=50, window=3, min_count=2, iter=10, seed=123),
]
for model in models:
mdl_name = str(model.__class__.__name__)
for classifier in classifiers:
clf_name = str(classifier.__class__.__name__)
pipeline = Pipeline([
('vec', model),
('clf', classifier)
])
cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
print("Cross-Validation Score for %s on %s on %s" % (mdl_name, clf_name, np.mean(cval)))
print("Done.")
如果我是正确的,当前管道将为循环中三个分类器的每个每个计算新的矢量化器,对吗?在给定的数据示例中,这没有问题,但是数据集很大。
如何只计算一次模型,然后将其插入几个分类器的管道中?
谢谢!