Pandas + CountVectorizer:如何快速过滤行

时间:2017-03-21 15:48:15

标签: python pandas scikit-learn

我在Pandas中有一个文本专栏:

df['TEXT_COL']

然后我将 CountVectorizer 应用于它:

vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])

获取一组单词/功能:

ft = v.get_feature_names()

和TDM:

m = vectorizer.transform(df['TEXT_COL'])

我需要: df 切片,其中只包含来自feature_set ft 的特定功能的行。

如何获得它?

Pandas设置:

import pandas as pd

data = [('Word'), ('Word Sea Ocean'), ('Tree'), ('Forest Tree')]

df = pd.DataFrame(data)
df.columns = ['TEXT_COL']

from sklearn.feature_extraction.text import CountVectorizer

vectorizer = CountVectorizer()
v = vectorizer.fit_transform(df['TEXT_COL'])

ft = vectorizer.get_feature_names()
m = vectorizer.transform(df['TEXT_COL'])

enter image description here

  

对于f in ft:

     

???

1 个答案:

答案 0 :(得分:1)

这是一个小型演示:

# execute your setup script ...

In [48]: vectorizer.vocabulary_
Out[48]: {'forest': 0, 'ocean': 1, 'sea': 2, 'tree': 3, 'word': 4}

m是一个稀疏矩阵

In [49]: m
Out[49]:
<4x5 sparse matrix of type '<class 'numpy.int64'>'
        with 7 stored elements in Compressed Sparse Row format>

我们可以将它转换为常规的numpy数组:

In [50]: m.toarray()
Out[50]:
array([[0, 0, 0, 0, 1],
       [0, 1, 1, 0, 1],
       [0, 0, 0, 1, 0],
       [1, 0, 0, 1, 0]], dtype=int64)

如何列出特定功能:

In [51]: m[:, vectorizer.vocabulary_['sea']].toarray()
Out[51]:
array([[0],
       [1],
       [0],
       [0]], dtype=int64)

或使用ft

In [57]: m[:, ft.index('sea')].toarray()
Out[57]:
array([[0],
       [1],
       [0],
       [0]], dtype=int64)

In [52]: df
Out[52]:
         TEXT_COL
0            Word
1  Word Sea Ocean
2            Tree
3     Forest Tree

让我们显示包含功能'tree'的所有行:

In [71]: idx = m[:, ft.index('tree')] == 1

In [72]: df[idx.toarray()]
Out[72]:
      TEXT_COL
2         Tree
3  Forest Tree

或者就像这样:

In [77]: df[m[:, ft.index('tree')].astype(bool).toarray()]
Out[77]:
      TEXT_COL
2         Tree
3  Forest Tree