Question

我正在使用（第一次）scikit库，我收到了这个错误：

ValueError: empty vocabulary; perhaps the documents only contain stop words
File "C:\Users\A605563\Desktop\velibProjetPreso\TraitementTwitterDico.py", line 33, in <module>
X_train_counts = count_vect.fit_transform(FileTweets)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 804, in fit_transform
self.fixed_vocabulary_)
File "C:\Python27\Lib\site-packages\sklearn\feature_extraction\text.py", line 751, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only contain stop words

但我不明白为什么会这样。

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy
import unicodedata
import nltk


TweetsFile = open('tweets2015-08-13.csv', 'r+')
f2 = open('analyzer.txt', 'a')
print TweetsFile.readline()
count_vect = CountVectorizer(strip_accents='ascii')
FileTweets =  TweetsFile.read()
FileTweets = FileTweets.decode('latin1')
FileTweets = unicodedata.normalize('NFKD', FileTweets).encode('ascii','ignore')
print FileTweets
for line in TweetsFile:
    f2.write(line.replace('\n', ' '))
TweetsFile = f2
print type(FileTweets)
X_train_counts = count_vect.fit_transform(FileTweets)
print X_train_counts.shape
TweetsFile.close()

我的数据是原始推文：

11/8/2015 @ Paris Marriott Champs Elysees Hotel "
2015-08-11 21:27:15,"I'm at Paris Marriott Hotel Champs-Elysees in Paris, FR <https://t.co/gAFspVw6FC>"
2015-08-11 21:24:08,"I'm at Four Seasons Hotel George V in Paris, Ile-de-France <https://t.co/dtPALvziWy>"
2015-08-11 21:22:11,    . @ Avenue des Champs-Elysees <https://t.co/8b7U05OAxG>
2015-08-11 20:54:18,Her pistol go @ Raspoutine Paris (Official) <https://t.co/le9l3dtdgM>
2015-08-11 20:50:14,"Desde Paris, con amor. @ Avenue des Champs-Elysees <https://t.co/R68JV3NT1z>"

有人知道这里发生了什么吗？

Answer 1

我找到了一个解决方案，这里是代码：

import sklearn
from sklearn.feature_extraction.text import CountVectorizer
import pandas as pd
import numpy as np
import unicodedata
import nltk 
import StringIO


TweetsFile = open('tweets2015-08-13.csv','r+')
yourResult = [line.split(',') for line in TweetsFile.readlines()]
count_vect = CountVectorizer(input="file")
docs_new = [ StringIO.StringIO(x) for x in yourResult ]
X_train_counts = count_vect.fit_transform(docs_new)
vocab = count_vect.get_feature_names()
print X_train_counts.shape

Answer 2

这是一个更简单的解决方案：

x = open('bad_words_train.txt', 'r+')
count_vect = CountVectorizer(input=file)
X_train = count_vect.fit_transform(x)
print(X_train)

ValueError：空词汇;也许这些文件只包含停用词

2 个答案: