Question

我使用gensim LDAModel进行客户评价的主题提取，如下所示：

Parser [Char]

这将返回以下主题中的unigrams：

dictionary = corpora.Dictionary(clean_reviews)
dictionary.filter_extremes(keep_n=11000) #change filters
dictionary.compactify()
dictionary_path = "dictionary.dict"
corpora.Dictionary.save(dictionary, dictionary_path)

# convert tokenized documents to vectors

corpus = [dictionary.doc2bow(doc) for doc in clean_reviews]
vocab = lda.datasets.load_reuters_vocab()  

# Training lda using number of topics set = 10 (which can be changed)

lda = gensim.models.LdaModel(corpus, id2word = dictionary,
                        num_topics = 20,
                        passes = 20,
                        random_state=1,
                        alpha = "auto")

但我正在寻找ngrams。我遇到了sklearn的LatentDirichletAllocation，它使用Tfidf矢量化器如下：

topic1 -delivery,parcel,location

topic2 -app, login, access

我们可以在vectorizer中指定ngrams的范围。是否有可能在gensim LDA模型中这样做。

抱歉，我对使用所有这些模型都很陌生，所以不太了解它们。

Answer 1

我知道这是一个老话题，但我想我将分享我在主题中获取k-gram所做的工作。我想在词汇表中包含二元，三元和四元。为此，在运行LDA模型之前，我使用了gensim的Phrases类。这是一个非常好的资源。

https://www.machinelearningplus.com/nlp/topic-modeling-gensim-python/#15visualizethetopicskeywords

我做了类似的事情。希望对您有帮助

如何实现Latent Dirichlet分配以在主题而不是unigrams中给出bigrams / trigrams

1 个答案: