是否有一种方法可以使用预先训练的LDA模型而不使用LDA_Model [unseenDoc]语法来获取未见文档的主题分布?我正在尝试将LDA模型实现到Web应用程序中,如果可以使用矩阵乘法来获得相似的结果,则可以在javascript中使用该模型。
例如,我尝试了以下操作:
import numpy as np
import gensim
from gensim.corpora import Dictionary
from gensim import models
import nltk
from nltk.stem import WordNetLemmatizer, SnowballStemmer
nltk.download('wordnet')
def Preprocesser(text_list):
smallestWordSize = 3
processedList = []
for token in gensim.utils.simple_preprocess(text_list):
if token not in gensim.parsing.preprocessing.STOPWORDS and len(token) > smallestWordSize:
processedList.append(StemmAndLemmatize(token))
return processedList
lda_model = models.LdaModel.load('LDAModel\GoldModel') #Load pretrained LDA model
dictionary = Dictionary.load("ModelTrain\ManDict") #Load dictionary model was trained on
#Sample Unseen Doc to Analyze
doc = "I am going to write a string about how I can't get my task executor \
to travel properly. I am trying to use the \
AGV navigator, but it doesn't seem to be working network. I have been trying\
to use the AGV Process flow but that isn't working either speed\
trailer offset I am now going to change this so I can see how fast it runs"
termTopicMatrix = lda_model.get_topics() #Get Term-topic Matrix from pretrained LDA model
cleanDoc = Preprocesser(doc) #Tokenize, lemmatize, clean and stem words
bowDoc = dictionary.doc2bow(cleanDoc) #Create bow using dictionary
dictSize = len(termTopicMatrix[0]) #Get length of terms in dictionary
fullDict = np.zeros(dictSize) #Initialize array which is length of dictionary size
First = [first[0] for first in bowDoc] #Get index of terms in bag of words
Second = [second[1] for second in bowDoc] #Get frequency of term in bag of words
fullDict[First] = Second #Add word frequency to full dictionary
print('Matrix Multiplication: \n', np.dot(termTopicMatrix,fullDict))
print('Conventional Syntax: \n', lda_model[bowDoc])
Output:
Matrix Multiplication:
[0.0283254 0.01574513 0.03669142 0.01671816 0.03742738 0.01989461
0.01558603 0.0370233 0.04648389 0.02887623 0.00776652 0.02147539
0.10045133 0.01084273 0.01229849 0.00743788 0.03747379 0.00345913
0.03086953 0.00628912 0.29406082 0.10656977 0.00618827 0.00406316
0.08775404 0.00785408 0.02722744 0.09957815 0.01669402 0.00744392
0.31177135 0.03063149 0.07211428 0.01192056 0.03228589]
Conventional Syntax:
[(0, 0.070313625), (2, 0.056414187), (18, 0.2016589), (20, 0.46500313), (24, 0.1589748)]
在预训练模型中,有35个主题和1155个单词。
在“常规语法”输出中,每个元组的第一个元素是主题的索引,第二个元素是主题的概率。在“矩阵乘法”版本中,概率是指数,值是概率。显然,两者不匹配。
例如,lda_model [unseenDoc]显示主题0的概率为0.07,但是矩阵乘法方法表明主题的概率为0.028。我在这里错过了一步吗?
答案 0 :(得分:0)
您可以在安装过程中或在线查看LDAModel
的{{1}}方法使用的完整源代码:
(它还使用同一文件中的get_document_topics()
方法。)
与代码相比,它的缩放/归一化/剪切操作要多得多,这很可能是导致差异的原因。但是您应该能够逐行检查您的流程及其不同之处,以找到匹配的步骤。
使用gensim代码的步骤作为创建并行Javascript代码的指南也应该很容易,在模型状态的正确部分下,该Javascript代码可以重现其结果。