如何使用sklearn的CountVectorizer进行矢量化和去矢量化?

时间:2017-01-14 12:26:25

标签: python scikit-learn sklearn-pandas

我想将一些文本向量化为相应的整数,然后将这些文本转换为其映射的整数,并使用新的输入整数[2,9,39,46,56,12,89,9]创建新的句子。

我见过一些可以用于此目的的自定义函数,但我想知道sklearn本身是否具有这样的功能。

from sklearn.feature_extraction.text import CountVectorizer

a=["""Lorem ipsum dolor sit amet, consectetur adipiscing elit. 
Morbi imperdiet mauris posuere, condimentum odio et, volutpat orci.
Curabitur sodales vulputate eros eu gravida. Sed pharetra imperdiet nunc et tempor.
Nullam lectus est, rhoncus vitae lacus at, fermentum aliquam metus.
Phasellus a sollicitudin tortor, non tempor nulla.
Etiam mattis felis enim, a malesuada ligula dignissim at.
Integer congue dolor ut magna blandit, lobortis consequat ante aliquam.
Nulla imperdiet libero eget lorem sagittis, eget iaculis orci dignissim. 
Phasellus sit amet sodales odio. Pellentesque commodo tempor risus, et tincidunt neque. 
Praesent et sem velit. Maecenas id risus sit amet ex convallis ultrices vel sed purus. 
Sed fringilla, leo quis congue sollicitudin, mauris nunc vehicula mi, et laoreet ligula 
urna et nulla. Nam sollicitudin urna sed dolor vehicula euismod. Mauris bibendum pulvinar
ornare. In suscipit sed mi ut posuere.
Proin egestas, nibh ut egestas mattis, ipsum nulla bibendum enim, ac suscipit nisl justo 
id metus. Nam est dui, elementum eget suscipit nec, aliquam in mi. Integer tortor erat,
aliquet at sapien et, fringilla posuere leo. Praesent non congue est. Vivamus tincidunt
tellus eu placerat tincidunt. Phasellus convallis lacus vitae ex congue efficitur.
Sed ut bibendum massa, vitae molestie ligula. Phasellus purus felis, fermentum vitae 
hendrerit vel, vulputate quis metus."""]


vec = CountVectorizer()
dtm=vec.fit_transform(a)
print vec.vocabulary_

#convert text to corresponding vectors
mapped_a=

#new sentence using below mapped values
#input [2,9,39,46,56,12,89,9]
#creating sentence using specific sequence

new_sentence=

2 个答案:

答案 0 :(得分:5)

要将句子矢量化为整数,您可以使用transform函数。此函数的输出是带有每个项的计数的向量 - 特征向量。

vec = CountVectorizer()
vec.fit(a)
print vec.vocabulary_

new_sentence = "dolor nulla enim"
mapped_a = vec.transform([new_sentence])
print mapped_a.toarray() # sparse feature vector

tokenizer = vec.build_tokenizer()
# array of words ids
for token in tokenizer(new_sentence):
    print vec.vocabulary_.get(token)

问题的第二部分并不那么简单。 CountVectorizer为此目的具有inverse_transform功能,并使用稀疏的要素向量作为输入。但是,在您的示例中,您希望创建一个可能出现相同术语的句子,并且使用该函数则无法实现。

然而,解决方案是使用词汇表(单词到id)并基于它构建反向词汇表(id到word)。默认情况下,CountVectorizer没有inverse_vocabulary,您必须根据vocabulary创建它。

input = [2,9,9]

# 1. inverse_transform function
# create sparse vector
sparse_input = [1 if i in input else 0 for i in range(0, len(vec.vocabulary_))]
print vec.inverse_transform(sparse_input)
> ['aliquam', 'commodo']


# 2. Inverse vocabulary - custom solution
terms = np.array(list(vec.vocabulary_.keys()))
indices = np.array(list(vec.vocabulary_.values()))
inverse_vocabulary = terms[np.argsort(indices)]

for i in input:
    print inverse_vocabulary[i]
> ['aliquam', 'commodo', 'commodo']

答案 1 :(得分:-1)

看看sklearn中的预处理库,LabelEncoder和OneHotEncoder通常用于编码分类变量。但不建议对整个文本进行编码!