scikit learn目标中的值错误(y参数)

时间:2015-10-24 07:19:12

标签: python nlp scikit-learn

我试图用tfidf和朴素贝叶斯分类器

对我的文本数据进行分类
cls = MultinomialNB()
vec = TfidfVectorizer(input='file', analyzer=word_tokenize, stop_words=stop_w, use_idf=False)
for i, filename in enumerate(files):

    with codecs.open(filename, encoding='utf8') as f:
        bow = vec.fit_transform(f)

        # and i have one target for this bow. (each file has unique subject)
        y = np.array([repeat(i, times=41253)])
        cls.fit(bow, y)

bow.shape输出就像这样

(41253, 15987)

但得到了这个例外

Traceback (most recent call last):
  File "/home/x/PycharmProjects/PWC/naiive.py", line 35, in <module>
    cls.fit(bow, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 522, in fit
    X, y = check_X_y(X, y, 'csr')
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 516, in check_X_y
    check_consistent_length(X, y)
  File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 176, in check_consistent_length
    "%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [    1 41253]

我知道我的y尺寸/形状有问题,但我不知道应该怎么解决它 并且我的y实现在第一时间是正确的吗?

1 个答案:

答案 0 :(得分:0)

那一行应该是:

y = repeat(i, times=41253)

删除附加的分号和np.array()调用。