在Lucene中访问位置匹配周围的单词

时间:2014-09-12 18:20:35

标签: java lucene position term posting

鉴于文档中的术语匹配,访问该匹配项周围的单词的最佳方法是什么?我读过这篇文章http://searchhub.org//2009/05/26/accessing-words-around-a-positional-match-in-lucene/, 但问题是Lucene API自从这篇文章(2009)以来完全改变了,有人能指出我如何在较新版本的Lucene中做到这一点,比如Lucene 4.6.1?

修改

我现在想出来了(已删除帖子API(TermEnum,TermDocsEnum,TermPositionsEnum),转而使用新的灵活索引(flex)API(Fields,FieldsEnum,Terms,TermsEnum,DocsEnum,DocsAndPositionsEnum)。一个很大的区别是字段和术语现在分别枚举:一个TermsEnum在一个字段内提供BytesRef(包装一个byte []),而不是Term。另一个是当你要求Docs / AndPositionsEnum时,你现在指定明确的skipDocs(通常这将是已删除的文档,但通常你可以提供任何位)。):

public class TermVectorFun {
  public static String[] DOCS = {
    "The quick red fox jumped over the lazy brown dogs.",
    "Mary had a little lamb whose fleece was white as snow.",
    "Moby Dick is a story of a whale and a man obsessed.",
    "The robber wore a black fleece jacket and a baseball cap.",
    "The English Springer Spaniel is the best of all dogs.",
    "The fleece was green and red",
        "History looks fondly upon the story of the golden fleece, but most people don't agree"
  };

  public static void main(String[] args) throws IOException {
    RAMDirectory ramDir = new RAMDirectory();
    IndexWriterConfig config = new IndexWriterConfig(Version.LUCENE_46, new StandardAnalyzer(Version.LUCENE_46));
    config.setOpenMode(IndexWriterConfig.OpenMode.CREATE);
    //Index some made up content
    IndexWriter writer = new IndexWriter(ramDir, config);
    for (int i = 0; i < DOCS.length; i++) {
      Document doc = new Document();
      Field id = new Field("id", "doc_" + i, Field.Store.YES, Field.Index.NOT_ANALYZED_NO_NORMS);
      doc.add(id);
      //Store both position and offset information
      Field text = new Field("content", DOCS[i], Field.Store.NO, Field.Index.ANALYZED, Field.TermVector.WITH_POSITIONS_OFFSETS);
      doc.add(text);
      writer.addDocument(doc);
    }
    writer.close();
    //Get a searcher

    DirectoryReader dirReader = DirectoryReader.open(ramDir);
    IndexSearcher searcher = new IndexSearcher(dirReader);
    // Do a search using SpanQuery
    SpanTermQuery fleeceQ = new SpanTermQuery(new Term("content", "fleece"));
    TopDocs results = searcher.search(fleeceQ, 10);
    for (int i = 0; i < results.scoreDocs.length; i++) {
      ScoreDoc scoreDoc = results.scoreDocs[i];
      System.out.println("Score Doc: " + scoreDoc);
    }
    IndexReader reader = searcher.getIndexReader();
    Spans spans = fleeceQ.getSpans(reader.leaves().get(0), null, new LinkedHashMap<Term, TermContext>());
    int window = 2;//get the words within two of the match
    while (spans.next() == true) {
      int start = spans.start() - window;
      int end = spans.end() + window;
      Map<Integer, String> entries = new TreeMap<Integer, String>();

      System.out.println("Doc: " + spans.doc() + " Start: " + start + " End: " + end);
      Fields fields = reader.getTermVectors(spans.doc());
      Terms terms = fields.terms("content");

      TermsEnum termsEnum = terms.iterator(null);
      BytesRef text;
      while((text = termsEnum.next()) != null) {        
        //could store the BytesRef here, but String is easier for this example
        String s = new String(text.bytes, text.offset, text.length);
        DocsAndPositionsEnum positionsEnum = termsEnum.docsAndPositions(null, null);
        if (positionsEnum.nextDoc() != DocIdSetIterator.NO_MORE_DOCS) {
          int i = 0;
          int position = -1;
          while (i < positionsEnum.freq() && (position = positionsEnum.nextPosition()) != -1) {
            if (position >= start && position <= end) {
              entries.put(position, s);
            }
            i++;
          }
        }
      }
      System.out.println("Entries:" + entries);
    }
  }
}

1 个答案:

答案 0 :(得分:0)

使用HighlighterHighlighter.getBestFragment可用于获取包含最佳匹配的文本的一部分。类似的东西:

TopDocs docs = searcher.search(query, maxdocs);
Document firstDoc = search.doc(docs.scoreDocs[0].doc);

Scorer scorer = new QueryScorer(query)
Highlighter highlighter = new Highlighter(scorer);
highlighter.GetBestFragment(myAnalyzer, fieldName, firstDoc.get(fieldName));