NotFittedError:TfidfVectorizer - 词汇表不适合python

时间:2018-02-28 16:30:21

标签: python-3.x machine-learning svm tf-idf predict

目标:预测原始数据上的标签

背景:我构建了一个SVM分类器

我使用以下代码:

0)导入模块

    import numpy as np
    from sklearn import cross_validation
    from sklearn import datasets
    from sklearn import svm
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import precision_score, recall_score,accuracy_score
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics import precision_recall_fscore_support

1)X_listy

type(X_list) #list, strings
len(X_list)  #2163
type(y) #numpy.ndarray
len(y)  #2163

2)将X_list从字符串转换为float,使用tfidf

tfidf = TfidfVectorizer()
X_vec = tfidf.fit_transform(X_list) 
X = X_vec.toarray()

3)X形状

X.shape  (2163, 8753)

4)10倍验证和SVM

skf = StratifiedKFold(n_splits=10) 
clf = svm.SVC(kernel='linear', C=1)

5)循环10次

precision_scores = []
recall_scores = []
f_scores = [] 

for train_index, test_index in skf.split(X, y): 
    X_train = X[train_index]
    X_test =  X[test_index]
    y_train = y[train_index]
    y_test =  y[test_index]

    clf.fit(X_train, y_train) 
    y_pred = clf.predict(X_test)

    precision_scores.append(scores[0])
    recall_scores.append(scores[1])
    f_scores.append(scores[2])

6)预测原始数据集X_original

type(X_original) #list, strings
len(X_original)  #2163

7)将X_original从字符串转换为浮动

tfidf = TfidfVectorizer()
X_original_transform = tfidf.transform(X_original) 

但是当我这样做时,我收到以下错误

`NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.`

SO有一个类似的问题,但它似乎与我的问题NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

不同

8)如何解决此错误?

1 个答案:

答案 0 :(得分:0)

在上面的第(7)点中,您可以看到您正在再次初始化tfidf,这会生成一个没有任何数据或信息的新TfidfVectorizer实例。那你就不合适了。因此错误。 你需要以与第(2)点相同的方式调用fit()。

将点(7)改为:

tfidf = TfidfVectorizer()
# fit_transform should be used here.
X_original_transform = tfidf.fit_transform(X_original) 

同样在第(2)点,您首先在整个数据集上拟合TfidfVectorizer,然后将其拆分为训练和测试。建议不要这样做,因为它在训练时会将有关数据的信息泄漏给模型。考虑一下这在现实世界中是如何运作的。您是否拥有要提前预测的数据的所有信息?不。您可以在可用数据上训练模型,并将其用于看不见的数据。第(2)点中的当前代码打破了这一点。

始终首先拆分为训练和测试,然后仅训练(fit())训练数据并使用该信息应用(transform())测试数据。

像这样改变:

1)首先删除第(2)点中的代码。我们将在折叠迭代中进行。

2)改变点(5)如:

for train_index, test_index in skf.split(X_list, y): 
    X_train = X_list[train_index]
    X_test =  X_list[test_index]
    y_train = y[train_index]
    y_test =  y[test_index]

    tfidf = TfidfVectorizer()

    # This is what I'm talking about
    X_train = tfidf.fit_transform(X_train) 
    clf.fit(X_train, y_train) 

    # Only call transform() here
    X_test = tfidf.transform(X_test) 
    y_pred = clf.predict(X_test)

    precision_scores.append(scores[0])
    recall_scores.append(scores[1])
    f_scores.append(scores[2])