在LDA模型中,这是使用现有模型推断新文档的两种方法。这两种方法有什么区别?
答案 0 :(得分:0)
我做了一些测试,我的ldamodel有8个主题,在这里我的结果: 2个文档来预测主题:
list_unseenTw=[['hope', 'miley', 'blow', 'peopl', 'mind', 'tonight', 'gain', 'million', 'fan'],['@mileycyrustour', "we'r", 'think', "it'", 'pretti', 'cool', 'miley', 'saturday', 'night', 'live', 'tonight', '#prettycool']]
使用lda [doc_bow]进行预测(它已经给出了匹配主题的百分比)
doc_bow = [list.dunseenTw中文本的[dictionary.doc2bow(text)] 预测= ldamodel [doc_bow]
预测[0]: [(0,0.02509002728802024), (1,0.0250114373070437), (2,0.025040162139306051), (3,0.82462688228515812), (4,0.025150924341817767), (5,0.025000027675139792), (6,0.025000024127660267), (7,0.025080514835853926)]
预测[1]: [(0,0.031250011319462589), (1,031250013721820222), (2,031250015639505598), (3,0.031250015093378707), (4,0.031250019670816337), (5,0.031250024860739675), (6,0.78124988084026048), (7,0.031250014854016454)]
使用ldamodel.inference进行预测(结果以权重而非百分比形式给出)
pred = ldamodel.inference(doc_bow)
打印(预解码值)
(array [[[0.12545023,0.1250572,0.12520085,4.12309694,0.12579184,0.12500014,0.12500012,0.12540268], [0.12500005,0.12500005,0.12500008,0.12500006,0.12500008,0.1250001,3.12499952,0.12500006]]),无)
如您所见,第一次预测(doc1)的结果与您相同(主题3):
total=0
for i in pred[0][0]:
total+=i
4.12309694/total = 0.82462%