Question

我有一组3000个文本文档，我想提取前300个关键字（可以是单个单词或多个单词）。

我尝试过以下方法 -

RAKE：这是一个基于Python的关键字提取库，它失败了。

Tf-Idf：它为每个文档提供了很好的关键字，但我们无法聚合它们并找到代表整个文档组的关键字。另外，根据Tf-Idf得分从每个文档中选择前k个单词不会有帮助，对吗？

Word2vec：我能够做一些很酷的事情，比如查找相似的字词，但不知道如何使用它找到重要的关键字。

你能否提出一些好的方法（或详细说明如何改进上述任何一种方法）来解决这个问题？谢谢:)）

Answer 1

你最好手动选择那300个单词（它不是那么多而且是一次） - 用Python 3编写的代码

import os
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
for file in files: 
        file_opened = open(file, "r")
        lines = file_opened.read().split("\n")
        for word in topWords: 
                if word in lines and wordsCount < 301:
                                print("I found %s" %word)
                                wordsCount += 1
        #Check Again wordsCount to close first repetitive instruction
        if wordsCount == 300:
                break

Answer 2

对大多数重要单词应用tf-idf实现最简单有效的方法。如果您有停用词，则可以在应用此代码之前过滤停用词。希望这对你有用。

import java.util.List;

/**
 * Class to calculate TfIdf of term.
 * @author Mubin Shrestha
 */
public class TfIdf {

    /**
     * Calculates the tf of term termToCheck
     * @param totalterms : Array of all the words under processing document
     * @param termToCheck : term of which tf is to be calculated.
     * @return tf(term frequency) of term termToCheck
     */
    public double tfCalculator(String[] totalterms, String termToCheck) {
        double count = 0;  //to count the overall occurrence of the term termToCheck
        for (String s : totalterms) {
            if (s.equalsIgnoreCase(termToCheck)) {
                count++;
            }
        }
        return count / totalterms.length;
    }

    /**
     * Calculates idf of term termToCheck
     * @param allTerms : all the terms of all the documents
     * @param termToCheck
     * @return idf(inverse document frequency) score
     */
    public double idfCalculator(List allTerms, String termToCheck) {
        double count = 0;
        for (String[] ss : allTerms) {
            for (String s : ss) {
                if (s.equalsIgnoreCase(termToCheck)) {
                    count++;
                    break;
                }
            }
        }
        return 1 + Math.log(allTerms.size() / count);
    }
}

Answer 3

尽管Latent Dirichlet allocation和Hierarchical Dirichlet Process通常用于派生文本语料库中的主题，然后使用这些主题对各个条目进行分类，但也可以开发一种派生整个语料库的关键字的方法。该方法受益于不依赖于另一个文本语料库。基本的工作流程是将这些Dirichlet关键字与最常用的单词进行比较，以查看LDA或HDP是否能够接受最常用的不单词。

在使用以下代码之前，通常建议对文本预处理进行以下操作：

从文本中删除标点符号（请参见string.punctuation）
将字符串文本转换为“令牌”（str.split（“”）.lower（）转换为单个单词）
删除数字和停用词（请参见stopwordsiso或stop_words）
创建二元组-文本中经常出现的单词组合（请参见gensim.Phrases）
使令牌合法化-将单词转换为基本形式（请参见spacy或NLTK）
删除频率不够高的令牌（或过于频繁，但在这种情况下，请跳过删除频率太高的令牌，因为它们将是很好的关键字）

这些步骤将在下面创建变量corpus。 here可以找到有关LDA的所有内容的很好概述。

现在用于gensim的LDA和HDP：

from gensim.models import LdaModel, HdpModel
from gensim import corpora

首先创建一个dirichlet词典，将corpus中的单词映射到索引，然后使用它来创建一袋单词，其中corpus中的标记被其索引替换。这是通过以下方式完成的：

dirichlet_dict = corpora.Dictionary(corpus)
bow_corpus = [dirichlet_dict.doc2bow(text) for text in corpus]

对于LDA，需要导出最佳主题数，可以通过this answer中的方法来启发式完成。假设我们的最佳主题数是10，并且根据这个问题，我们需要300个关键字：

num_topics = 10
num_keywords = 300

创建LDA模型：

dirichlet_model = LdaModel(corpus=bow_corpus,
                           id2word=dirichlet_dict,
                           num_topics=num_topics,
                           update_every=1,
                           chunksize=len(bow_corpus),
                           passes=20,
                           alpha='auto')

接下来是一个功能，可基于整个语料库中的平均主题来得出最佳主题。首先，将列出每个主题中最重要的单词的有序列表；然后找到每个主题与整个语料库的平均连贯性；最后，根据此平均连贯性对主题进行排序，并将其与随后使用的平均值列表一起返回。所有这些的代码如下（包括从下面使用HDP的选项）：

def order_subset_by_coherence(dirichlet_model, bow_corpus, num_topics=10, num_keywords=10):
    """
    Orders topics based on their average coherence across the corpus

    Parameters
    ----------
        dirichlet_model : gensim.models.type_of_model
        bow_corpus : list of lists (contains (id, freq) tuples)
        num_topics : int (default=10)
        num_keywords : int (default=10)

    Returns
    -------
        ordered_topics, ordered_topic_averages: list of lists and list
    """
    if type(dirichlet_model) == gensim.models.ldamodel.LdaModel:
        shown_topics = dirichlet_model.show_topics(num_topics=num_topics, 
                                                   num_words=num_keywords,
                                                   formatted=False)
    elif type(dirichlet_model)  == gensim.models.hdpmodel.HdpModel:
        shown_topics = dirichlet_model.show_topics(num_topics=150, # return all topics
                                                   num_words=num_keywords,
                                                   formatted=False)
    model_topics = [[word[0] for word in topic[1]] for topic in shown_topics]
    topic_corpus = dirichlet_model.__getitem__(bow=bow_corpus, eps=0) # cutoff probability to 0 

    topics_per_response = [response for response in topic_corpus]
    flat_topic_coherences = [item for sublist in topics_per_response for item in sublist]

    significant_topics = list(set([t_c[0] for t_c in flat_topic_coherences])) # those that appear
    topic_averages = [sum([t_c[1] for t_c in flat_topic_coherences if t_c[0] == topic_num]) / len(bow_corpus) \
                      for topic_num in significant_topics]

    topic_indexes_by_avg_coherence = [tup[0] for tup in sorted(enumerate(topic_averages), key=lambda i:i[1])[::-1]]

    significant_topics_by_avg_coherence = [significant_topics[i] for i in topic_indexes_by_avg_coherence]
    ordered_topics = [model_topics[i] for i in significant_topics_by_avg_coherence][:num_topics] # limit for HDP

    ordered_topic_averages = [topic_averages[i] for i in topic_indexes_by_avg_coherence][:num_topics] # limit for HDP
    ordered_topic_averages = [a/sum(ordered_topic_averages) for a in ordered_topic_averages] # normalize HDP values

    return ordered_topics, ordered_topic_averages

现在获取关键字列表-主题中最重要的单词。这是通过基于每个主题的整体平均连贯性从每个已排序主题中分词（默认情况下又按重要性排序）来完成的。为了明确说明，假设仅存在两个主题，文本与第一个主题的一致性为70％，第二个主题的一致性为30％。然后，关键字可以是第一个主题中前70％的单词，而第二个主题中前30％尚未选择的单词。这可以通过以下方法实现：

ordered_topics, ordered_topic_averages = \
    order_subset_by_coherence(dirichlet_model=dirichlet_model,
                              bow_corpus=bow_corpus, 
                              num_topics=num_topics,
                              num_keywords=num_keywords)

keywords = []
for i in range(num_topics):
    # Find the number of indexes to select, which can later be extended if the word has already been selected
    selection_indexes = list(range(int(round(num_keywords * ordered_topic_averages[i]))))
    if selection_indexes == [] and len(keywords) < num_keywords: 
        # Fix potential rounding error by giving this topic one selection
        selection_indexes = [0]
              
    for s_i in selection_indexes:
        if ordered_topics[i][s_i] not in keywords and ordered_topics[i][s_i] not in ignore_words:
            keywords.append(ordered_topics[i][s_i])
        else:
            selection_indexes.append(selection_indexes[-1] + 1)

# Fix for if too many were selected
keywords = keywords[:num_keywords]

上面还包含变量ignore_words，该变量是一个不应该包含在结果中的单词的列表。

对于HDP，模型遵循与上述类似的过程，只是在模型创建中不需要传递num_topics和其他参数。 HDP本身会得出最佳主题，但随后需要使用order_subset_by_coherence对这些主题进行排序和子集化，以确保将最佳主题用于有限选择。通过以下方式创建模型：

dirichlet_model = HdpModel(corpus=bow_corpus, 
                           id2word=dirichlet_dict,
                           chunksize=len(bow_corpus))

最好同时测试LDA和HDP，因为如果能够找到合适数量的主题，LDA可以根据问题的需求胜过（这仍然是HDP的标准）。将Dirichlet关键字与单独的单词频率进行比较，希望生成的是与文本的整体主题更相关的关键字列表，而不仅仅是最常见的单词。

显然，根据文本连贯性百分比从主题中选择有序单词并不能按重要性对关键字进行整体排序，因为稍后会选择一些在整体连贯性较弱的主题中非常重要的单词。

使用LDA为语料库中的各个文本生成关键字的过程可以在this answer中找到。

Answer 4

import os
import operator
from collections import defaultdict
files = os.listdir()
topWords = ["word1", "word2.... etc"]
wordsCount = 0
words = defaultdict(lambda: 0)
for file in files:
    open_file = open(file, "r")
    for line in open_file.readlines():
        raw_words = line.split()
        for word in raw_words:
            words[word] += 1
sorted_words = sorted(words.items(), key=operator.itemgetter(1))

现在从排序的单词中取出前300，它们就是你想要的单词。

从一组文档中提取最重要的关键字

4 个答案: