在Wikipedia语料库上对LDA进行培训以标记任意文章?

时间:2018-09-01 05:26:58

标签: python nltk gensim

我按照gensim Python https://radimrehurek.com/gensim/wiki.html中的步骤训练了LDA模型上的维基百科,现在我想将cnn.com中的任意文章与训练后的数据进行比较,下一步该怎么做?假设本文在txt文件中?

1 个答案:

答案 0 :(得分:0)

使用from here

# Create a new corpus, made of previously unseen documents.
cnn_article = [
    ['This', 'is', 'my', 'cnn', 'article'],
    ]
other_corpus = [common_dictionary.doc2bow(text) for text in cnn_article]
unseen_doc = other_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document

然后使用gensims Similarity class获得相似之处。

更新

要更精确地参考本教程和您的文本文件,请执行以下操作:

# Create a corpus from a list of texts
common_dictionary = Dictionary(common_texts)
common_corpus = [common_dictionary.doc2bow(text) for text in common_texts]

# Train the model on the corpus.
lda = LdaModel(common_corpus, id2word=common_dictionary, num_topics=10)

# optional: print topics of your model
for topic in lda.print_topics(10):
    print(topic)

# load your CNN article from file
with open("cnn.txt", "r") as file:
    cnn = file.read()

# split article into list of words and make this list an element of a list
cnn = [cnn.split(" ")]

cnn_corpus = [common_dictionary.doc2bow(text) for text in cnn]

unseen_doc = cnn_corpus[0]
vector = lda[unseen_doc] # get topic probability distribution for a document

# print out «similarity» of cnn article to each of the topics
# bigger number = more similar to topic 
print(vector)