gensim LDA的输入语料库应如何显示?

时间:2018-12-28 03:31:52

标签: python-3.x gensim lda topic-modeling

我尝试将两种不同类型的输入语料库放入gensim LDA模型 我的文档是:

documents = ["Apple is releasing a new product", 
         "Amazon sells many things",
         "Microsoft announces Nokia acquisition"]   
texts = [[word for word in document.lower().split() if word not in stop_words] for document in documents]   
texts1 = []
for i in texts:
    for t in i:
      texts1.append([t]) 

并使用gensim使其成为语料库

corpus = [[(0, 1), (1, 1), (2, 1), (3, 1)], [(4, 1), (5, 1), (6, 1), (7, 1)], [(8, 1), (9, 1), (10, 1), (11, 1)]]
corpus1 = [[(0, 1)], [(1, 1)], [(2, 1)], [(3, 1)], [(4, 1)], [(5, 1)], [(6, 1)], [(7, 1)], [(8, 1)], [(9, 1)], [(10, 1)], [(11, 1)]]

如果我使用这两种方式将其放入LDA模型,会有很大的不同吗?

当我尝试这两种方式时,区别在于主题中单词的概率分布,就概率而言,corpus1corpus小得多。

我尝试使用更大的文档进行LDA,corpus1总是向我展示出极低的概率,例如0.0001

是否有更好的方法将语料库放入LDA模型?

0 个答案:

没有答案