使用scikit-learn矢量化器和词汇表与gensim

时间:2014-02-04 12:25:16

标签: python scikit-learn topic-modeling gensim

我正在尝试使用gensim主题模型回收scikit-learn vectorizer对象。原因很简单:首先,我已经拥有大量的矢量化数据;第二,我更喜欢scikit-learn矢量化器的界面和灵活性;第三,尽管使用gensim的主题建模非常快,但根据我的经验计算其词典(Dictionary())相对较慢。

之前已经提出了类似的问题,especially herehere,并且桥接解决方案是gensim的Sparse2Corpus()函数,它将Scipy稀疏矩阵转换为gensim语料库对象。

但是,此转换不会使用sklearn矢量化程序的vocabulary_属性,该属性包含单词和要素ID之间的映射。这种映射是必要的,以便打印每个主题的判别词(id2word在gensim主题模型中,描述为“从单词ID(整数)到单词(字符串)的映射”)。

我知道gensim的Dictionary对象比scikit的vect.vocabulary_(一个简单的Python dict)更复杂(并且计算速度慢)...

在gensim模型中使用vect.vocabulary_作为id2word的任何想法?

一些示例代码:

# our data
documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']

from sklearn.feature_extraction.text import CountVectorizer
# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)
# each doc is a scipy sparse matrix
print vect.vocabulary_
#{u'and': 1, u'minors': 20, u'generation': 9, u'testing': 32, u'iv': 15, u'engineering': 5, u'computer': 4, u'relation': 28, u'human': 11, u'measurement': 19, u'unordered': 37, u'binary': 3, u'abc': 0, u'for': 8, u'ordering': 23, u'graph': 10, u'system': 31, u'machine': 17, u'to': 35, u'quasi': 26, u'time': 34, u'random': 27, u'paths': 24, u'of': 21, u'trees': 36, u'applications': 2, u'management': 18, u'lab': 16, u'interface': 13, u'intersection': 14, u'response': 29, u'perceived': 25, u'in': 12, u'widths': 40, u'well': 39, u'eps': 6, u'survey': 30, u'error': 7, u'opinion': 22, u'the': 33, u'user': 38}

import gensim
# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)
lsi = gensim.models.LsiModel(corpus_vect_gensim, num_topics=4)
# I instead would like something like this line below
# lsi = gensim.models.LsiModel(corpus_vect_gensim, id2word=vect.vocabulary_, num_topics=2)
print lsi.print_topics(2)
#['0.622*"21" + 0.359*"31" + 0.256*"38" + 0.206*"29" + 0.206*"34" + 0.197*"36" + 0.170*"33" + 0.168*"1" + 0.158*"10" + 0.147*"4"', '0.399*"36" + 0.364*"10" + -0.295*"31" + 0.245*"20" + -0.226*"38" + 0.194*"26" + 0.194*"15" + 0.194*"39" + 0.194*"23" + 0.194*"40"']

5 个答案:

答案 0 :(得分:12)

Gensim不需要Dictionary个对象。您可以直接使用普通dict作为id2word的输入,只要它将ids(整数)映射到单词(字符串)。

事实上任何类似dict的行为(包括dictDictionarySqliteDict ......)。

(顺便说一下,gensim' s Dictionary是一个简单的Python dict。 不确定您对Dictionary性能的评论来自何处,您无法比Python中的普通dict更快地获得映射。也许你会把它与文本预处理混淆(不是gensim的一部分),这确实很慢。)

答案 1 :(得分:6)

为了提供最后一个例子,scikit-learn的矢量化器对象可以转换为gensim的语料库格式Sparse2Corpus,而词汇dict可以通过简单地交换键和值来回收:

# transform sparse matrix into gensim corpus
corpus_vect_gensim = gensim.matutils.Sparse2Corpus(corpus_vect, documents_columns=False)

# transform scikit vocabulary into gensim dictionary
vocabulary_gensim = {}
for key, val in vect.vocabulary_.items():
    vocabulary_gensim[val] = key

答案 2 :(得分:4)

我也在使用这两个进行一些代码实验。显然,有一种方法可以从语料库中构建字典

from gensim.corpora.dictionary import Dictionary
dictionary = Dictionary.from_corpus(corpus_vect_gensim,
                                    id2word=dict((id, word) for word, id in vect.vocabulary_.items()))

然后您可以将此词典用于tfidf,LSI或LDA模型。

答案 3 :(得分:2)

由于我还没有50的声誉,所以给出答案。

直接使用vect.vocabulary_(键和值互换)将无法在Python 3上运行,因为dict.keys()现在返回一个可迭代视图而不是列表。 相关错误是:

TypeError: can only concatenate list (not "dict_keys") to list

要在Python 3上运行此功能,请将lsimodel.py中的第301行更改为

self.num_terms = 1 + max([-1] + list(self.id2word.keys()))

希望这有帮助。

答案 4 :(得分:1)

适用于python 3代码的解决方案。

import gensim
from gensim.corpora.dictionary import Dictionary
from sklearn.feature_extraction.text import CountVectorizer

def vect2gensim(vectorizer, dtmatrix):
     # transform sparse matrix into gensim corpus and dictionary
    corpus_vect_gensim = gensim.matutils.Sparse2Corpus(dtmatrix, documents_columns=False)
    dictionary = Dictionary.from_corpus(corpus_vect_gensim,
        id2word=dict((id, word) for word, id in vectorizer.vocabulary_.items()))

    return (corpus_vect_gensim, dictionary)

documents = [u'Human machine interface for lab abc computer applications',
        u'A survey of user opinion of computer system response time',
        u'The EPS user interface management system',
        u'System and human system engineering testing of EPS',
        u'Relation of user perceived response time to error measurement',
        u'The generation of random binary unordered trees',
        u'The intersection graph of paths in trees',
        u'Graph minors IV Widths of trees and well quasi ordering',
        u'Graph minors A survey']


# compute vector space with sklearn
vect = CountVectorizer(min_df=1, ngram_range=(1, 1), max_features=25000)
corpus_vect = vect.fit_transform(documents)

# transport to gensim
(gensim_corpus, gensim_dict) = vect2gensim(vect, corpus_vect)