加载泡菜NotFittedError:TfidfVectorizer-词汇不正确

时间:2019-07-26 04:21:05

标签: python-3.x machine-learning nlp pickle tfidfvectorizer

多标签分类

我正在尝试使用scikit-learn / pandas / OneVsRestClassifier / logistic回归预测多标签分类。建立和评估模型是可行的,但尝试对新的示例文本进行分类则无法。

场景1:

一旦我构建了一个模型,便用名称(sample.pkl)保存了该模型并重新启动内核,但是当我在对示例文本进行预测期间加载保存的模型(sample.pkl)时,得到其给出的错误:

 NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.

我建立模型并评估模型,然后将模型保存为名字sample.pkl。我重新准备了内核,然后在示例文本NotFittedError:TfidfVectorizer-词汇不正确

上加载了模型预测

推断

import pickle,os
import collections
import numpy as np
import pandas as pd
import seaborn as sns
from tqdm import tqdm
import matplotlib.pyplot as plt
from collections import Counter
from nltk.corpus import stopwords
import json, nltk, re, csv, pickle
from sklearn.metrics import f1_score # performance matrix
from sklearn.multiclass import OneVsRestClassifier # binary relavance
from sklearn.linear_model import LogisticRegression  
from sklearn.model_selection import train_test_split  
from sklearn.feature_extraction.text import TfidfVectorizer 
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import MultiLabelBinarizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
stop_words = set(stopwords.words('english'))

def cleanHtml(sentence):
'''' remove the tags '''
    cleanr = re.compile('<.*?>')
    cleantext = re.sub(cleanr, ' ', str(sentence))
    return cleantext


def cleanPunc(sentence): 
''' function to clean the word of any
    punctuation or special characters '''
    cleaned = re.sub(r'[?|!|\'|"|#]',r'',sentence)
    cleaned = re.sub(r'[.|,|)|(|\|/]',r' ',cleaned)
    cleaned = cleaned.strip()
    cleaned = cleaned.replace("\n"," ")
    return cleaned

def keepAlpha(sentence):
""" keep the alpha sentenes """
    alpha_sent = ""
    for word in sentence.split():
        alpha_word = re.sub('[^a-z A-Z]+', ' ', word)
        alpha_sent += alpha_word
        alpha_sent += " "
    alpha_sent = alpha_sent.strip()
return alpha_sent

def remove_stopwords(text):
""" remove stop words """
    no_stopword_text = [w for w in text.split() if not w in stop_words]
    return ' '.join(no_stopword_text)

test1 = pd.read_csv("C:\\Users\\abc\\Downloads\\test1.csv")
test1.columns

test1.head()
siNo  plot                              movie_name       genre_new
1     The story begins with Hannah...   sing             [drama,teen]
2     Debbie's favorite band is Dream.. the bigeest fan  [drama]
3     This story of a Zulu family is .. come back,africa [drama,Documentary]

遇到错误 我对示例文本进行推理时出现错误

def infer_tags(q):
    q = cleanHtml(q)
    q = cleanPunc(q)
    q = keepAlpha(q)
    q = remove_stopwords(q)
    multilabel_binarizer = MultiLabelBinarizer()
    tfidf_vectorizer = TfidfVectorizer()
    q_vec = tfidf_vectorizer.transform([q])
    q_pred = clf.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)


for i in range(5):
    print(i)
    k = test1.sample(1).index[0] 
    print("Movie: ", test1['movie_name'][k], "\nPredicted genre: ", infer_tags(test1['plot'][k])), print("Actual genre: ",test1['genre_new'][k], "\n")

enter image description here

已解决

我解决了将tfidf和multibiniraze保存到泡菜模型中的问题

from sklearn.externals import joblib
pickle.dump(tfidf_vectorizer, open("tfidf_vectorizer.pickle", "wb"))
pickle.dump(multilabel_binarizer, open("multibinirizer_vectorizer.pickle", "wb"))
vectorizer = joblib.load('/abc/downloads/tfidf_vectorizer.pickle')
multilabel_binarizer = joblib.load('/abc/downloads/multibinirizer_vectorizer.pickle')


def infer_tags(q):
    q = cleanHtml(q)
    q = cleanPunc(q)
    q = keepAlpha(q)      
    q = remove_stopwords(q)
    q_vec = vectorizer .transform([q])
    q_pred = rf_model.predict(q_vec)
    return multilabel_binarizer.inverse_transform(q_pred)

我通过下面的链接找到了解决方案 ,How do I store a TfidfVectorizer for future use in scikit-learn?>

1 个答案:

答案 0 :(得分:1)

之所以会发生这种情况,是因为您只是将分类器转储到泡菜中,而不是矢量化器中。

在推理期间,当您致电

 tfidf_vectorizer = TfidfVectorizer()

,您的向量化器未安装在训练词汇表上,这会导致错误。

您应该做的是,将分类器和矢量器都转储到泡菜中。在推理过程中都加载它们。