我在Pandas中有一个文本专栏:
df['TEXT_COL']
然后我将 CountVectorizer 应用于它:
vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])
获取一组单词/功能:
ft = v.get_feature_names()
和TDM:
m = vectorizer.transform(df['TEXT_COL'])
我需要: df 切片,其中只包含来自feature_set ft 的特定功能的行。
如何获得它?
Pandas设置:
import pandas as pd
data = [('Word'), ('Word Sea Ocean'), ('Tree'), ('Forest Tree')]
df = pd.DataFrame(data)
df.columns = ['TEXT_COL']
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])
ft = vectorizer.get_feature_names()
m = vectorizer.transform(df['TEXT_COL'])
对于f in ft:
???
答案 0 :(得分:1)
这是一个小型演示:
# execute your setup script ...
In [48]: vectorizer.vocabulary_
Out[48]: {'forest': 0, 'ocean': 1, 'sea': 2, 'tree': 3, 'word': 4}
m
是一个稀疏矩阵
In [49]: m
Out[49]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
with 7 stored elements in Compressed Sparse Row format>
我们可以将它转换为常规的numpy数组:
In [50]: m.toarray()
Out[50]:
array([[0, 0, 0, 0, 1],
[0, 1, 1, 0, 1],
[0, 0, 0, 1, 0],
[1, 0, 0, 1, 0]], dtype=int64)
如何列出特定功能:
In [51]: m[:, vectorizer.vocabulary_['sea']].toarray()
Out[51]:
array([[0],
[1],
[0],
[0]], dtype=int64)
或使用ft
:
In [57]: m[:, ft.index('sea')].toarray()
Out[57]:
array([[0],
[1],
[0],
[0]], dtype=int64)
In [52]: df
Out[52]:
TEXT_COL
0 Word
1 Word Sea Ocean
2 Tree
3 Forest Tree
让我们显示包含功能'tree'
的所有行:
In [71]: idx = m[:, ft.index('tree')] == 1
In [72]: df[idx.toarray()]
Out[72]:
TEXT_COL
2 Tree
3 Forest Tree
或者就像这样:
In [77]: df[m[:, ft.index('tree')].astype(bool).toarray()]
Out[77]:
TEXT_COL
2 Tree
3 Forest Tree