有没有办法使用矩阵乘法从gensim LDA预训练模型中推断看不见的文档上的主题分布?

时间:2020-06-04 17:00:29

标签: gensim lda topic-modeling

是否有一种方法可以使用预先训练的LDA模型而不使用LDA_Model [unseenDoc]语法来获取未见文档的主题分布?我正在尝试将LDA模型实现到Web应用程序中,如果可以使用矩阵乘法来获得相似的结果,则可以在javascript中使用该模型。


import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer

def Preprocesser(text_list):

    smallestWordSize = 3
    processedList = []

    for token in gensim.utils.simple_preprocess(text_list):
        if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:

    return processedList

lda_model = models.LdaModel.load('LDAModel\GoldModel')  #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict")      #Load dictionary model was trained on

#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"

termTopicMatrix = lda_model.get_topics()    #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc)                #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc)       #Create bow using dictionary
dictSize = len(termTopicMatrix[0])          #Get length of terms in dictionary
fullDict = np.zeros(dictSize)               #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc]      #Get index of terms in bag of words
Second = [second[1] for second in bowDoc]   #Get frequency of term in bag of words
fullDict[First] = Second                    #Add word frequency to full dictionary

print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])

Matrix Multiplication: 
 [0.0283254  0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
 0.01558603 0.0370233  0.04648389 0.02887623 0.00776652 0.02147539
 0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
 0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
 0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
 0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax: 
 [(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]



例如,lda_model [unseenDoc]显示主题0的概率为0.07,但是矩阵乘法方法表明主题的概率为0.028。我在这里错过了一步吗?

1 个答案:

