解释lda一致性,同时执行coherence.get_coherence()...关于id2Token的KeyError是什么意思?

时间:2018-07-18 20:49:04

标签: python text-mining gensim topic-modeling

我创建了一个LDA模型和一种评估其性能的方法。我开始在我的LDA模型中添加一些非常不常用的术语过滤条件,并解决了其性能方面的其他问题。但是现在,似乎有时我调用getCoherence()时模型崩溃,而其他时候却没有。我真的不知道为什么会这样...我需要知道给出的错误消息的确切含义。 就我而言,我正在为198000条推文的数据集创建主题模型。这是评估方法代码

def evaluate_LDA_model(name_of_lda, dict_num):
    model_path = "models/" + name_of_lda
    dict_path = "dictionary_files/clean_doc_dict_" + str(dict_num) + ".dict"
    lda_model = gensim.models.ldamodel.LdaModel.load(model_path)
    doc_dict = corpora.Dictionary.load(dict_path)
    doc_clean_lst = getDocuments(clean=True) # returns dataframe with a column called 'cleaned_tokens' that has all of the processed tokens with their document/tweet
    docs_of_tokens = convert_cleaned_tokens_entries(doc_clean_lst['cleaned_tokens']) # Additional method for processing the cleaned tokens
    doc_term_matrix = [doc_dict.doc2bow(doc) for doc in docs_of_tokens]
    log_file_path = "txt_files/" + name_of_lda + "evaluation.txt"=
    coherence_model_var = CoherenceModel(model=lda_model, texts=docs_of_tokens, dictionary=doc_dict, coherence='c_v') # I wouldn't be surprised if I was passing in a parameter incorrectly here... I had trouble figuring out which parameters were supposed to be where.
    p = lda_model.log_perplexity(doc_term_matrix, total_docs = len(docs_of_tokens))
    coherence_lda = coherence_model_var.get_coherence() # crashes here.
    print("LOGGING RESULTS")
    with open(log_file_path, 'a') as log:
        log.write('Perplexity: \t' +  str(p) + '\n') # a measure of how good the model is. lower the better.
        log.write("Coherence Score: " +  str(coherence_lda) + '\n')
        log.write("Topics\n")
    pp(lda_model.print_topics(num_topics=10, num_words=10))
    print("FINISHED EVALUATION")

为清楚起见(因为我知道变量名不直观,部分原因是我对参数感到困惑),doc_dict js是通过将docs_of_tokens传递给corpora.Dictionary( )的构造函数。

docs_of_tokens是我通常从csv文件提取的内容。但这应该是一个List或pandas系列List。 docs_of_tokens中的每个列表/条目都是一条推文或单个文档。 docs_of_tokens中每个列表的内容具有该特定推文的所有令牌。

所以我收到的错误消息很长,但是我要发布整个跟踪。参见最后一行。

我的问题只是

  

键错误是什么意思?什么令牌和/或密钥导致了问题,我该如何解决?

     

调用getcoherence时正在评估什么?我正在尝试阅读连贯分数,但老实说,我真的不知道我的语料库发生了什么(表示为

  Traceback (most recent call last):
  File "text_mining.py", line 394, in <module>
    main()
  File "text_mining.py", line 388, in main
    evaluate_LDA_model("lda_topic_model5", 3)
  File "text_mining.py", line 351, in evaluate_LDA_model
    coherence_lda = coherence_model_var.get_coherence()
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\models\coherencemodel
.py", line 603, in get_coherence
    confirmed_measures = self.get_coherence_per_topic()
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\models\coherencemodel
.py", line 563, in get_coherence_per_topic
    self.estimate_probabilities(segmented_topics)
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\models\coherencemodel
.py", line 535, in estimate_probabilities
    self._accumulator = self.measure.prob(**kwargs)
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\proba
bility_estimation.py", line 138, in p_boolean_sliding_window
    accumulator = ParallelWordOccurrenceAccumulator(processes, top_ids, dictiona
ry)
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 424, in __init__
    super(ParallelWordOccurrenceAccumulator, self).__init__(*args)
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 280, in __init__
    super(WindowedTextsAnalyzer, self).__init__(relevant_ids, dictionary)
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 185, in __init__
    self.relevant_words = _ids_to_words(self.relevant_ids, dictionary)
  File "C:\Users\biney\Miniconda3\lib\site-packages\gensim\topic_coherence\text_
analysis.py", line 60, in _ids_to_words
    word = dictionary.id2token[word_id]
KeyError: 9212

0 个答案:

没有答案