我正在尝试使用gensim库训练LDA模型,而我正在使用以下功能:
def train_lda(data):
"""This function trains the lda model
We setup parameters like number of topics, the chunksize to use in Hoffman method
We also do 2 passes of the data since this is a small dataset, so we want the distributions to stabilize
"""
num_topics = 5
chunksize = 20
dictionary = corpora.Dictionary(data['tokenized'])
corpus = [dictionary.doc2bow(doc) for doc in data['tokenized']]
t1 = time.time()
# low alpha means each document is only represented by a small number of topics, and vice versa
# low eta means each topic is only represented by a small number of words, and vice versa: alpha=1e-2, eta=0.5e-2(orginal value)
lda = LdaModel(corpus=corpus, num_topics=num_topics, id2word=dictionary,
alpha=1e-5, eta='auto', chunksize=chunksize, minimum_probability=0.0, passes=20, iterations = 500)
t2 = time.time()
print("Time to train LDA model on ", len(df), "articles: ", (t2-t1)/60, "min")
return dictionary,corpus,lda
代码运行无任何异常。由于我的用例,我将num_topics固定为5。但是我说一些奇怪的事情,例如文档之一的主题如下:
2018-09-06 21:53:33,435 : INFO : topic #0 (0.000): 0.399*"person" + 0.200*"address" + 0.009*"joshua" + 0.008*"braiden" + 0.008*"jessica" + 0.007*"wilson" + 0.007*"jake" + 0.007*"cameron" + 0.006*"fullwood" + 0.005*"bethani"
2018-09-06 21:53:33,436 : INFO : topic #1 (0.000): 0.400*"person" + 0.200*"address" + 0.010*"matthew" + 0.008*"mitchel" + 0.007*"emiili" + 0.006*"ami" + 0.005*"jacob" + 0.005*"cooper" + 0.005*"samuel" + 0.005*"bartlett"
2018-09-06 21:53:33,436 : INFO : topic #2 (0.000): 0.399*"person" + 0.200*"address" + 0.014*"dylan" + 0.011*"jone" + 0.008*"harrison" + 0.007*"ali" + 0.007*"jed" + 0.006*"ethan" + 0.006*"nguyen" + 0.006*"kazuki"
2018-09-06 21:53:33,436 : INFO : topic #3 (0.000): 0.401*"person" + 0.201*"address" + 0.014*"sophi" + 0.009*"nichola" + 0.006*"jack" + 0.005*"ella" + 0.005*"piri" + 0.005*"gregov" + 0.005*"preyser" + 0.005*"mclellan"
2018-09-06 21:53:33,437 : INFO : topic #4 (0.000): 0.405*"person" + 0.203*"address" + 0.011*"thoma" + 0.010*"jame" + 0.007*"harri" + 0.007*"alexand" + 0.006*"william" + 0.006*"lachlan" + 0.006*"benjamin" + 0.005*"nathan"
您注意到,人员和地址是所有主题中贡献最大的(和所有其他文档类似)。因此,当我为看不见的文档寻找前5个最相似的文档时,会产生错误的结果。
请问该如何解决? 阿尔法广告和少量主题广告的最佳价值是什么?