我试图用tfidf和朴素贝叶斯分类器
对我的文本数据进行分类cls = MultinomialNB()
vec = TfidfVectorizer(input='file', analyzer=word_tokenize, stop_words=stop_w, use_idf=False)
for i, filename in enumerate(files):
with codecs.open(filename, encoding='utf8') as f:
bow = vec.fit_transform(f)
# and i have one target for this bow. (each file has unique subject)
y = np.array([repeat(i, times=41253)])
cls.fit(bow, y)
bow.shape输出就像这样
(41253, 15987)
但得到了这个例外
Traceback (most recent call last):
File "/home/x/PycharmProjects/PWC/naiive.py", line 35, in <module>
cls.fit(bow, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/naive_bayes.py", line 522, in fit
X, y = check_X_y(X, y, 'csr')
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 516, in check_X_y
check_consistent_length(X, y)
File "/usr/local/lib/python2.7/dist-packages/sklearn/utils/validation.py", line 176, in check_consistent_length
"%s" % str(uniques))
ValueError: Found arrays with inconsistent numbers of samples: [ 1 41253]
我知道我的y尺寸/形状有问题,但我不知道应该怎么解决它 并且我的y实现在第一时间是正确的吗?
答案 0 :(得分:0)
那一行应该是:
y = repeat(i, times=41253)
删除附加的分号和np.array()调用。