循环:避免多次计算矢量化器

时间:2019-01-15 10:18:02

标签: python loops scikit-learn

我有一个与循环有关的问题,该循环包括带有矢量化模型的管道:

%%time
import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from gensim.sklearn_api import D2VTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.neighbors import KNeighborsClassifier
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.pipeline import Pipeline
from gensim.utils import simple_preprocess
np.random.seed(0)

data = pd.read_csv('https://pastebin.com/raw/dqKFZ12m')
X_train, X_test, y_train, y_test = train_test_split([simple_preprocess(doc) for doc in data.text],
                                                    data.label, random_state=0)

classifiers = [
        LogisticRegression(random_state=0),
        LinearSVC(random_state=0),
        KNeighborsClassifier()
]

models = [
    CountVectorizer(preprocessor=' '.join, tokenizer=None),
    TfidfVectorizer(preprocessor=' '.join, tokenizer=None),
    D2VTransformer(dm=1, size=50, window=3, min_count=2, iter=10, seed=123),
]

for model in models:

    mdl_name = str(model.__class__.__name__)

    for classifier in classifiers:

            clf_name = str(classifier.__class__.__name__)

            pipeline = Pipeline([
                    ('vec', model),
                    ('clf', classifier)
                ])

            cval = cross_val_score(pipeline, X_train, y_train, scoring='accuracy', cv=5)
            print("Cross-Validation Score for %s on %s on %s" % (mdl_name, clf_name, np.mean(cval)))

print("Done.")

如果我是正确的,当前管道将为循环中三个分类器的每个每个计算新的矢量化器,对吗?在给定的数据示例中,这没有问题,但是数据集很大。

如何只计算一次模型,然后将其插入几个分类器的管道中?

谢谢!

0 个答案:

没有答案