我创建了一个LDA模型和一种评估其性能的方法。我开始在我的LDA模型中添加一些非常不常用的术语过滤条件,并解决了其性能方面的其他问题。但是现在,似乎有时我调用getCoherence()
时模型崩溃,而其他时候却没有。我真的不知道为什么会这样...我需要知道给出的错误消息的确切含义。
就我而言,我正在为198000条推文的数据集创建主题模型。这是评估方法代码
def evaluate_LDA_model(name_of_lda, dict_num):
model_path = "models/" + name_of_lda
dict_path = "dictionary_files/clean_doc_dict_" + str(dict_num) + ".dict"
lda_model = gensim.models.ldamodel.LdaModel.load(model_path)
doc_dict = corpora.Dictionary.load(dict_path)
doc_clean_lst = getDocuments(clean=True) # returns dataframe with a column called 'cleaned_tokens' that has all of the processed tokens with their document/tweet
docs_of_tokens = convert_cleaned_tokens_entries(doc_clean_lst['cleaned_tokens']) # Additional method for processing the cleaned tokens
doc_term_matrix = [doc_dict.doc2bow(doc) for doc in docs_of_tokens]
log_file_path = "txt_files/" + name_of_lda + "evaluation.txt"=
coherence_model_var = CoherenceModel(model=lda_model, texts=docs_of_tokens, dictionary=doc_dict, coherence='c_v') # I wouldn't be surprised if I was passing in a parameter incorrectly here... I had trouble figuring out which parameters were supposed to be where.
p = lda_model.log_perplexity(doc_term_matrix, total_docs = len(docs_of_tokens))
coherence_lda = coherence_model_var.get_coherence() # crashes here.
print("LOGGING RESULTS")
with open(log_file_path, 'a') as log:
log.write('Perplexity: \t' + str(p) + '\n') # a measure of how good the model is. lower the better.
log.write("Coherence Score: " + str(coherence_lda) + '\n')
log.write("Topics\n")
pp(lda_model.print_topics(num_topics=10, num_words=10))
print("FINISHED EVALUATION")
为清楚起见(因为我知道变量名不直观,部分原因是我对参数感到困惑),doc_dict
js是通过将docs_of_tokens
传递给corpora.Dictionary( )的构造函数。
docs_of_tokens
是我通常从csv文件提取的内容。但这应该是一个List或pandas系列List。 docs_of_tokens
中的每个列表/条目都是一条推文或单个文档。 docs_of_tokens
中每个列表的内容具有该特定推文的所有令牌。
所以我收到的错误消息很长,但是我要发布整个跟踪。参见最后一行。
我的问题只是
键错误是什么意思?什么令牌和/或密钥导致了问题,我该如何解决?
调用getcoherence时正在评估什么?我正在尝试阅读连贯分数,但老实说,我真的不知道我的语料库发生了什么(表示为
Traceback (most recent call last):
File "text_mining.py", line 394, in <module>
main()
File "text_mining.py", line 388, in main
evaluate_LDA_model("lda_topic_model5", 3)
File "text_mining.py", line 351, in evaluate_LDA_model
coherence_lda = coherence_model_var.get_coherence()
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\models\coherencemodel
.py", line 603, in get_coherence
confirmed_measures = self.get_coherence_per_topic()
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\models\coherencemodel
.py", line 563, in get_coherence_per_topic
self.estimate_probabilities(segmented_topics)
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\models\coherencemodel
.py", line 535, in estimate_probabilities
self._accumulator = self.measure.prob(**kwargs)
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\proba
bility_estimation.py", line 138, in p_boolean_sliding_window
accumulator = ParallelWordOccurrenceAccumulator(processes, top_ids, dictiona
ry)
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 424, in __init__
super(ParallelWordOccurrenceAccumulator, self).__init__(*args)
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 280, in __init__
super(WindowedTextsAnalyzer, self).__init__(relevant_ids, dictionary)
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 185, in __init__
self.relevant_words = _ids_to_words(self.relevant_ids, dictionary)
File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 60, in _ids_to_words
word = dictionary.id2token[word_id]
KeyError: 9212