检索文档的许多重要术语

时间:2014-03-13 17:20:21

标签: search solr lucene tf-idf

我正在寻找一种简单的方法来获取一个列表,其中包含描述特定文档的5-10个最重要的术语。它甚至可以基于特定领域,比如项目描述。

我认为这应该很容易。无论如何,Solr根据文档中相对出现次数与其在所有文档中的总体出现次数(tf-idf)对每个术语进行评分

然而,我无法找到一种方法如何将文件传递给Solr并获取我想要的术语列表。

3 个答案:

答案 0 :(得分:2)

如果您只需要文档中的顶级字词,则可以使用Term Vector Component,假设您的字段为termVectors="true" 你可以要求tv.tf_idf并获得最高分的前n个术语。

答案 1 :(得分:0)

您可能正在寻找MoreLikeThis component,特别是启用了 mlt.interestingTerms 标记。

答案 2 :(得分:0)

我想你可能想要追求某些类型的单词,通常使用名词。我曾经做过类似的聚类例程,我使用OpenNLP语音标记器来提取所有名词短语(使用chunker或词性标注器),然后简单地将每个术语放在HashMap中。 下面是一些使用Sentence分块的代码,但是用直接的词性来完成它可能是一个简单的改编(但如果你需要帮助,请告诉我)。 代码所做的是提取每个词性,然后对词性进行分块,在块上循环以获得名词短语,然后将其添加到术语频率哈希映射。真的很简单。你可以选择跳过所有OpenNLP的东西,但是你会想要去除很多噪音等等。无论如何......看看。

import java.io.File;
import java.io.FileInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.util.HashMap;
import opennlp.tools.chunker.ChunkerME;
import opennlp.tools.chunker.ChunkerModel;
import opennlp.tools.postag.POSModel;
import opennlp.tools.postag.POSTaggerME;
import opennlp.tools.tokenize.TokenizerME;
import opennlp.tools.tokenize.TokenizerModel;
import opennlp.tools.util.Span;

/**
 *
 * Extracts noun phrases from a sentence. To create sentences using OpenNLP use
 * the SentenceDetector classes.
 */
public class OpenNLPNounPhraseExtractor {

  static final int N = 2;

  public static void main(String[] args) {

    try {
      HashMap<String, Integer> termFrequencies = new HashMap<>();
      String modelPath = "c:\\temp\\opennlpmodels\\";
      TokenizerModel tm = new TokenizerModel(new FileInputStream(new File(modelPath + "en-token.zip")));
      TokenizerME wordBreaker = new TokenizerME(tm);
      POSModel pm = new POSModel(new FileInputStream(new File(modelPath + "en-pos-maxent.zip")));
      POSTaggerME posme = new POSTaggerME(pm);
      InputStream modelIn = new FileInputStream(modelPath + "en-chunker.zip");
      ChunkerModel chunkerModel = new ChunkerModel(modelIn);
      ChunkerME chunkerME = new ChunkerME(chunkerModel);
      //this is your sentence
      String sentence = "Barack Hussein Obama II  is the 44th awesome President of the United States, and the first African American to hold the office.";
      //words is the tokenized sentence
      String[] words = wordBreaker.tokenize(sentence);
      //posTags are the parts of speech of every word in the sentence (The chunker needs this info of course)
      String[] posTags = posme.tag(words);
      //chunks are the start end "spans" indices to the chunks in the words array
      Span[] chunks = chunkerME.chunkAsSpans(words, posTags);
      //chunkStrings are the actual chunks
      String[] chunkStrings = Span.spansToStrings(chunks, words);
      for (int i = 0; i < chunks.length; i++) {
        String np = chunkStrings[i];
        if (chunks[i].getType().equals("NP")) {
          if (termFrequencies.containsKey(np)) {
            termFrequencies.put(np, termFrequencies.get(np) + 1);
          } else {
            termFrequencies.put(np, 1);
          }
        }
      }
      System.out.println(termFrequencies);

    } catch (IOException e) {
    }
  }

}