Question

您好我是java开发人员并且学习Lucene。我有一个java类索引pdf（lucene_in_action_2nd_edition.pdf）文件和一个从索引搜索某些文本的搜索类。 IndexSearcher正在提供Document，表明字符串存在于索引（lucene_in_action_2nd_edition.pdf）中。

但现在I want to get searched data or metadata. i.e. I want to know that at which page string is matched, or few text around matched string, etc...怎么做？

这是我的LuceneSearcher.java类：

public static void main(String[] args) throws Exception {
    File indexDir = new File("D:\\index");

    String querystr = "Advantages of FastVectorHighlighter";
    Query q = new QueryParser(Version.LUCENE_40, "contents",
            new StandardAnalyzer(Version.LUCENE_40)).parse(querystr);

    int hitsPerPage = 100;
    IndexReader reader = DirectoryReader.open(FSDirectory.open(indexDir));
    IndexSearcher searcher = new IndexSearcher(reader);
    TopScoreDocCollector collector = TopScoreDocCollector.create(
            hitsPerPage, true);
    searcher.search(q, collector);
    ScoreDoc[] hits = collector.topDocs().scoreDocs;

    System.out.println("Found " + hits.length + " hits.");
    for (int i = 0; i < hits.length; i++) {
        int docId = hits[i].doc;
        Document d = searcher.doc(docId);
        System.out.println((i + 1) + "... " + d.get("filename"));
        System.out.println("=====================================================");
        System.out.println(d.get("contents"));


    }

    // reader can only be closed when there
    // is no need to access the documents any more.
    reader.close();
}

此处d.get("contents")提供.pdf文件的全文（generated by Tika），该文件是在编制索引时存储的。

我想了解有关搜索文本的一些信息，以便我可以在我的网页上显示或正确突出显示搜索到的文本（例如谷歌搜索输出）。怎么实现呢？我们需要写一些逻辑还是Lucene在内部做它？

任何类型的帮助将不胜感激。提前谢谢。

Answer 1

org.apache.lucene.search.highlight包提供此功能。

如：

SimpleHTMLFormatter htmlFormatter = new SimpleHTMLFormatter();
Highlighter highlighter = new Highlighter(htmlFormatter, new QueryScorer(query));
for (int i = 0; i < hits.length; i++) {
    int docId = hits[i].doc;
    Document d = searcher.doc(docId);
    String text = doc.get("contents");
    String bestFrag = highlighter.getBestFragment(analyzer, "contents", text);
    //output, however you like.

如果您愿意，还可以从荧光笔中获取最佳碎片列表，而不只是一个，请参阅Highlighter API

在Lucene中获取搜索的数据/元数据

1 个答案: