计算nGram时,Sklearn CountVectorizer在Dataframe中出现“空词汇”错误

时间:2017-05-12 15:34:36

标签: python pandas dataframe scikit-learn countvectorizer

我有一个包含3条记录的数据框(data):

id    text
0001  The farmer plants grain
0002  The fisher catches tuna
0003  The police officer fights crime

我按id:

对该数据帧进行分组
data_grouped = data.groupby('id')

描述生成的groupby对象显示所有记录都保留。

然后我运行此代码以在text中找到nGrams并将它们加入id

word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2), 
analyzer='word')
for id, group in data_grouped:
       X = word_vectorizer.fit_transform(group['text'])
       frequencies = sum(X).toarray()[0]
       results = pd.DataFrame(frequencies, columns=['frequency'])
       dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
       dfinner['id'] = id
       final = results.join(dfinner)

当我同时运行所有这些代码时,错误会导致word_vectorizer表示“空词汇;可能文档只包含停用词”。我知道在许多其他问题中已经提到过这个错误,但我找不到一个处理Dataframe的问题。

为了使问题进一步复杂化,错误并不总是显示出来。我从SQL DB中提取数据,并且根据我输入的记录数量,错误可能会也可能不会显示。例如,拉入Top 10条记录会导致错误,但Top 5不会。

编辑:

完成追溯

Traceback (most recent call last):

  File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
    runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
    execfile(filename, namespace)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
    exec(compile(f.read(), filename, 'exec'), namespace)

  File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
    X = word_vectorizer.fit_transform(group['cleanComments'])

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
    self.fixed_vocabulary_)

  File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
    raise ValueError("empty vocabulary; perhaps the documents only"

ValueError: empty vocabulary; perhaps the documents only contain stop words

1 个答案:

答案 0 :(得分:2)

我看到这里发生了什么,但是在经历它的过程中,我有一个唠叨的问题。 你为什么这样做?我不太清楚我是否理解将CountVectorizer安装到文档集合中的每个文档的价值。一般来说,想法是将它适合整个语料库,然后从那里进行分析。我知道你可能希望能够看到每个文档中存在哪些克,但是还有其他更容易和更优化的方法。例如:

df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
 'farmer plants',
 'fights crime',
 'fisher catches',
 'officer fights',
 'plants grain',
 'police officer',
 'the farmer',
 'the fisher',
 'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
 [1 0 0 1 0 0 0 0 1 0]
 [0 0 1 0 1 0 1 0 0 1]]

很好,你可以看到CountVectorizer提取的特征和每个文档中存在的特征的矩阵表示。 dt_mat是文档术语矩阵,表示每个文档的词汇表(特征)中每个克(频率)的计数。要将此映射回克,甚至将其放入DataFrame,您可以执行以下操作:

df['grams'] = cv.inverse_transform(dt_mat)
print(df)
   id                             text  \
0   1          The farmer plants grain
1   2          The fisher catches tuna
2   3  The police officer fights crime

                                               grams
0          [plants grain, farmer plants, the farmer]
1         [catches tuna, fisher catches, the fisher]
2  [fights crime, officer fights, police officer,...

就个人而言,这感觉更有意义,因为您将CountVectorizer拟合到整个语料库,而不是一次只有一个文档。您仍然可以提取相同的信息(频率和克数),当您在文档中扩展时,这将更快。