我是Gensims LDA的主题建模和实验领域的新手,可以找到我的数据集中讨论的主题。
nr_topics = 75 # how many topics would you expect
ldamodel = gensim.models.LdaModel(corpus, # the term frequency matrix # LdaModel = basic version
id2word=dictionary, # the id -> term dictionary
num_topics=nr_topics, # the nr of topics we want
update_every = 1,
chunksize = 250,
passes=20, # increase this for added precision
alpha = 0.1, # a low aplha = few topics per document
eta = 0.2, # a low eta = few word combinations per topic
minimum_probability = 0.02
)
ldamodel.print_topics(num_topics=20, num_words=10)
但是更改所有这些参数不会产生良好的结果,并且运行模型大约需要一个小时,因此使用这些参数非常耗时。我认为,如果我对标记化的输入数据应用tf-idf,并且对这个想法有2个问题,则输出会更好:
dictionary=corpora.Dictionary(sentences.content)
corpus=[dictionary.doc2bow(i, allow_update = True)for i in sentences.content]