Question

使用sklearn，我用Python创建了一个具有200个功能的BOW，可以轻松提取这些功能。但是，我该如何扭转呢？即，从具有200个0或1的向量到相应的单词？由于词汇是字典，因此没有顺序，因此我不确定功能列表中每个元素对应哪个词。另外，如果我200维向量中的第一个元素对应于字典中的第一个单词，那么我该如何通过索引从字典中提取一个单词？

BOW是通过这种方式创建的

vec = CountVectorizer(stop_words = sw, strip_accents="unicode", analyzer = "word", max_features = 200)
features = vec.fit_transform(data.loc[:,"description"]).todense()

因此，“特征”是一个矩阵（n，200）矩阵（n是句子数）。

Answer 1

我不确定您要干什么，但似乎您只是想弄清楚哪一列代表哪个词。为此，有一个方便的get_feature_names参数。

让我们看一下docs中提供的示例语料库：

corpus = [
     'This is the first document.',
     'This document is the second document.',
     'And this is the third one.',
     'Is this the first document?' ]

# Put into a dataframe
data = pd.DataFrame(corpus,columns=['description'])
# Take a look:
>>> data
                             description
0            This is the first document.
1  This document is the second document.
2             And this is the third one.
3            Is this the first document?

# Initialize CountVectorizer (you can put in your arguments, but for the sake of example, I'm keeping it simple):
vec = CountVectorizer()

# Fit it as you had before:
features = vec.fit_transform(data.loc[:,"description"]).todense()

>>> features
matrix([[0, 1, 1, 1, 0, 0, 1, 0, 1],
        [0, 2, 0, 1, 0, 1, 1, 0, 1],
        [1, 0, 0, 1, 1, 0, 1, 1, 1],
        [0, 1, 1, 1, 0, 0, 1, 0, 1]], dtype=int64)

要查看哪个列代表哪个单词，请使用get_feature_names：

>>> vec.get_feature_names()
['and', 'document', 'first', 'is', 'one', 'second', 'the', 'third', 'this']

因此，您的第一列是and，第二列是document，依此类推。为了提高可读性，您可以将其粘贴在数据框中：

>>> pd.DataFrame(features, columns = vec.get_feature_names())
   and  document  first  is  one  second  the  third  this
0    0         1      1   1    0       0    1      0     1
1    0         2      0   1    0       1    1      0     1
2    1         0      0   1    1       0    1      1     1
3    0         1      1   1    0       0    1      0     1

从特征词到单词python（“反向”单词袋）

1 个答案: