ValueError:在二进制SVM上找到输入样本数量不一致的输入变量

时间:2019-09-16 14:29:16

标签: python pandas scikit-learn nlp valueerror

尝试在20_newsgroups数据集上运行二进制SVM。似乎正在发生ValueError:找到输入变量的样本数不一致:[783,1177]。谁能说明为什么会这样?

from sklearn.datasets import fetch_20newsgroups
from nltk.corpus import names
from nltk.stem import WordNetLemmatizer
# from sklearn.feature_extraction.text import CountVectorizer
import numpy as np
import pandas as pd
categories = ["comp.graphics", 'sci.space']
data_train = fetch_20newsgroups(subset='train', categories=categories, random_state=42)
data_test = fetch_20newsgroups(subset='test', categories=categories, random_state=42)

def is_letter_only(word) : 
    return word.isalpha()
all_names = set (names.words())
lemmatizer = WordNetLemmatizer()
def clean_text(docs) : 
    docs_cleaned = []
    for doc in docs:
        doc = doc.lower()
        doc_cleaned = ' '.join(lemmatizer.lemmatize(word)
                for word in doc.split() if is_letter_only(word)
                and word not in all_names)
        docs_cleaned.append(doc_cleaned)
    return docs_cleaned

cleaned_train = clean_text(data_train.data)
label_train = data_train.target
cleaned_test = clean_text(data_train.data)
label_test = data_test.target
len(label_train),len(label_test)

from sklearn.feature_extraction.text import TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_features=None)
term_docs_train = tfidf_vectorizer.fit_transform(cleaned_train)
term_docs_test = tfidf_vectorizer.transform(cleaned_test)

from sklearn.svm import SVC
svm = SVC(kernel='linear', C=1.0, random_state=42)

svm.fit(term_docs_train, label_train)

accuracy = svm.score(term_docs_test, label_test)
print(accuracy)

1 个答案:

答案 0 :(得分:0)

该错误仅告诉您您要尝试预测标签的样本数量和输出标签的数量存在差异。发生这种情况的原因是,您使用与训练和测试集相同的数据,但是随后尝试匹配大小不同的测试集的标签。

只需修复此行:

cleaned_test = clean_text(data_test.data)

,脚本的结果为:

0.966794380587484