我已经创建了一个名为get_document_topics_for_corpus
的辅助函数来获取元组列表。但是,效率不高。它的调用方式如下:
topics = lda.get_document_topics(corpus, per_word_topics=True)
doc_topics, word_topics, word_phis = get_document_topics_for_corpus(topics)
print "Document topics: ", doc_topics
print "Word topics: ", word_topics
print "Word phis:", word_phis
并且,返回的结果是:
Document topics: [(96, 0.75250000000000006), (34, 0.80200000000000227), (70, 0.80200000000000093), (60, 0.75250000000000161), (80, 0.85857142857136792), (58, 0.7525000000000015), (91, 0.75250000000000017), (28, 0.50499999999999268), (62, 0.66999998118978443)]
Word topics: [(0, [96, 70]), (1, [96, 80]), (2, [96, 34]), (3, [80, 58]), (4, [80, 58]), (5, [80, 91]), (6, [80, 70, 34]), (7, [80, 70, 58]), (8, [70, 34]), (9, [28, 62, 60]), (10, [62, 60, 91]), (11, [60, 91])]
Word phis: [(0, [(96, 0.99999999999999989), (70, 0.99999999999999989)]), (1, [(96, 0.99999999999999989), (80, 1.0)]), (2, [(96, 0.99999999999999989), (34, 1.0)]), (3, [(80, 1.0), (58, 1.0)]), (4, [(80, 1.0), (58, 1.0000000000000002)]), (5, [(80, 1.0), (91, 1.0)]), (6, [(80, 1.0), (70, 1.0), (34, 2.0)]), (7, [(80, 1.0), (70, 1.0), (58, 1.0000000000000002)]), (8, [(70, 1.0), (34, 1.0)]), (9, [(28, 1.0), (62, 1.0), (60, 1.0)]), (10, [(62, 1.0), (60, 1.0), (91, 0.99999999999999989)]), (11, [(60, 1.0), (91, 1.0)])]
我已经编写了帮助函数来执行此任务,如下所示:
def get_document_topics_for_corpus(topics):
document_topics = dict()
word_topics = dict()
word_phis = dict()
doc_topics = list()
word_top = list()
word_ph = list()
for doc_topic, word_topic, word_phi in topics:
#Document_topics aggregation
key_doc = doc_topic[0][0]
value_doc = doc_topic[0][1]
document_topics.setdefault(key_doc, value_doc)
#Word_topics aggregation
for key in word_topic:
word_topics.setdefault(key[0], [])
word_topics[key[0]].append(key_doc)
#Word_phis aggregation
for key in word_phi:
word_phis.setdefault(key[0], [])
word_phis[key[0]].append(key[1][0])
for key, value in document_topics.iteritems():
temp = (key, value)
doc_topics.append(temp)
for key, value in word_topics.iteritems():
temp = (key, value)
word_top.append(temp)
for key, value in word_phis.iteritems():
temp = (key, value)
word_ph.append(temp)
return (doc_topics, word_top, word_ph)
我正在从主题列表中聚合此结果,其中每个主题都是由文档主题,单词主题和word_phis组成的元组。为了理解这一点,主题如下所示,其中每个主题由' -------'
分隔new doc
Document topics: [(79, 0.75250000000000072)]
Word topics: [(0, [79]), (1, [79]), (2, [79])]
Word phis: [(0, [(79, 1.0)]), (1, [(79, 1.0)]), (2, [(79, 1.0)])]
--------------
new doc
Document topics: [(23, 0.85857142857143054)]
Word topics: [(1, [23]), (3, [23]), (4, [23]), (5, [23]), (6, [23]), (7, [23])]
Word phis: [(1, [(23, 1.0)]), (3, [(23, 1.0)]), (4, [(23, 1.0)]), (5, [(23, 1.0)]), (6, [(23, 1.0)]), (7, [(23, 1.0)])]
--------------
new doc
Document topics: [(28, 0.80199999993851401)]
Word topics: [(0, [28]), (6, [28]), (7, [28]), (8, [28])]
Word phis: [(0, [(28, 1.0)]), (6, [(28, 1.0)]), (7, [(28, 1.0000000000000002)]), (8, [(28, 1.0)])]
--------------
任何人都可以帮助转换此功能,以便更加优化并尽可能快(并生成相同的输出)??这将非常有帮助。感谢。