我有一个包含3条记录的数据框(data
):
id text
0001 The farmer plants grain
0002 The fisher catches tuna
0003 The police officer fights crime
我按id:
对该数据帧进行分组data_grouped = data.groupby('id')
描述生成的groupby对象显示所有记录都保留。
然后我运行此代码以在text
中找到nGrams并将它们加入id
:
word_vectorizer = CountVectorizer(stop_words=None, ngram_range=(2,2),
analyzer='word')
for id, group in data_grouped:
X = word_vectorizer.fit_transform(group['text'])
frequencies = sum(X).toarray()[0]
results = pd.DataFrame(frequencies, columns=['frequency'])
dfinner = pd.DataFrame(word_vectorizer.get_feature_names())
dfinner['id'] = id
final = results.join(dfinner)
当我同时运行所有这些代码时,错误会导致word_vectorizer
表示“空词汇;可能文档只包含停用词”。我知道在许多其他问题中已经提到过这个错误,但我找不到一个处理Dataframe的问题。
为了使问题进一步复杂化,错误并不总是显示出来。我从SQL DB中提取数据,并且根据我输入的记录数量,错误可能会也可能不会显示。例如,拉入Top 10
条记录会导致错误,但Top 5
不会。
编辑:
完成追溯
Traceback (most recent call last):
File "<ipython-input-63-d261e44b8cce>", line 1, in <module>
runfile('C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py', wdir='C:/Users/taca/Documents/Work/Python/Text Analytics')
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 866, in runfile
execfile(filename, namespace)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/Users/taca/Documents/Work/Python/Text Analytics/owccomments.py", line 38, in <module>
X = word_vectorizer.fit_transform(group['cleanComments'])
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 839, in fit_transform
self.fixed_vocabulary_)
File "C:\Users\taca\AppData\Local\Continuum\Anaconda3\lib\site-packages\sklearn\feature_extraction\text.py", line 781, in _count_vocab
raise ValueError("empty vocabulary; perhaps the documents only"
ValueError: empty vocabulary; perhaps the documents only contain stop words
答案 0 :(得分:2)
我看到这里发生了什么,但是在经历它的过程中,我有一个唠叨的问题。 你为什么这样做?我不太清楚我是否理解将CountVectorizer安装到文档集合中的每个文档的价值。一般来说,想法是将它适合整个语料库,然后从那里进行分析。我知道你可能希望能够看到每个文档中存在哪些克,但是还有其他更容易和更优化的方法。例如:
df = pd.DataFrame({'id': [1,2,3], 'text': ['The farmer plants grain', 'The fisher catches tuna', 'The police officer fights crime']})
cv = CountVectorizer(stop_words=None, ngram_range=(2,2), analyzer='word')
dt_mat = cv.fit_transform(df.text)
print(cv.get_feature_names())
['catches tuna',
'farmer plants',
'fights crime',
'fisher catches',
'officer fights',
'plants grain',
'police officer',
'the farmer',
'the fisher',
'the police']
print(dt_mat.todense())
[[0 1 0 0 0 1 0 1 0 0]
[1 0 0 1 0 0 0 0 1 0]
[0 0 1 0 1 0 1 0 0 1]]
很好,你可以看到CountVectorizer提取的特征和每个文档中存在的特征的矩阵表示。 dt_mat
是文档术语矩阵,表示每个文档的词汇表(特征)中每个克(频率)的计数。要将此映射回克,甚至将其放入DataFrame,您可以执行以下操作:
df['grams'] = cv.inverse_transform(dt_mat)
print(df)
id text \
0 1 The farmer plants grain
1 2 The fisher catches tuna
2 3 The police officer fights crime
grams
0 [plants grain, farmer plants, the farmer]
1 [catches tuna, fisher catches, the fisher]
2 [fights crime, officer fights, police officer,...
就个人而言,这感觉更有意义,因为您将CountVectorizer拟合到整个语料库,而不是一次只有一个文档。您仍然可以提取相同的信息(频率和克数),当您在文档中扩展时,这将更快。