lda [doc_bow]和lda.inference(语料库)之间有什么区别?

时间:2014-11-26 09:26:51

标签: python lda gensim

在LDA模型中,这是使用现有模型推断新文档的两种方法。这两种方法有什么区别?

1 个答案:

答案 0 :(得分:0)

我做了一些测试,我的ldamodel有8个主题,在这里我的结果: 2个文档来预测主题:

list_unseenTw=[['hope', 'miley', 'blow', 'peopl', 'mind', 'tonight', 'gain', 'million', 'fan'],['@mileycyrustour', "we'r", 'think', "it'", 'pretti', 'cool', 'miley', 'saturday', 'night', 'live', 'tonight', '#prettycool']]
  1. 使用lda [doc_bow]进行预测(它已经给出了匹配主题的百分比)

    doc_bow = [list.dunseenTw中文本的[dictionary.doc2bow(text)] 预测= ldamodel [doc_bow]

    预测[0]: [(0,0.02509002728802024),  (1,0.0250114373070437),  (2,0.025040162139306051),  (3,0.82462688228515812),  (4,0.025150924341817767),  (5,0.025000027675139792),  (6,0.025000024127660267),  (7,0.025080514835853926)]

    预测[1]: [(0,0.031250011319462589),  (1,031250013721820222),  (2,031250015639505598),  (3,0.031250015093378707),  (4,0.031250019670816337),  (5,0.031250024860739675),  (6,0.78124988084026048),  (7,0.031250014854016454)]

  2. 使用ldamodel.inference进行预测(结果以权重而非百分比形式给出)

    pred = ldamodel.inference(doc_bow)

    打印(预解码值)

    (array [[[0.12545023,0.1250572,0.12520085,4.12309694,0.12579184,0.12500014,0.12500012,0.12540268],         [0.12500005,0.12500005,0.12500008,0.12500006,0.12500008,0.1250001,3.12499952,0.12500006]]),无)

  3. 如您所见,第一次预测(doc1)的结果与您相同(主题3):

    total=0
    
    for i in pred[0][0]:
    
            total+=i
    
    4.12309694/total = 0.82462%