Question

我已经在Lucene上实现了潜在的语义分析

算法的结果是2列矩阵，其中第一列是文档的索引和第二列的相似性。

我想将org.apache.lucene.search.Collector中的响应写入Searcher的方法搜索，但我不知道如何在收集器对象中设置结果。

搜索方法的代码是：

    public void search(Weight weight, Filter filter, Collector collector) throws IOException                
{
    String textQuery = weight.getQuery().toString("contents");
    System.out.println(textQuery);
    double[][] ind;
    ind = lsa.searchOnDoc(textQuery);
    //ind contains the index and the similarity
    if (ind != null)
    {
        //construct the collector object
        for (int i=0; i<ind.length; i++)
        {
            int doc =(int) ind[i][0];
            double simi = ind[i][1]
            //collector.collect(doc);
            //collector.setScorer(sim]);
            //This is the problem
        }
    }
    else
    {
        collector = null;
    }
}

我不知道在收集器对象中复制ind值的正确步骤。

你能帮助我吗？

Answer 1

我不明白为什么你决定把LSI推到Searcher 从Weight获取文本查询看起来特别阴暗 - 为什么不使用原始查询而跳过所有（损坏的）转换？

但Collector的处理如下对于索引中的每个细分：

使用SegmentReader提供相应的collector.setNextReader(reader, base)。您可以在顶级阅读器上使用ir.getSequentialSubReaders()和ir.getSubReaderStarts()获取这些内容。所以，
- base是添加到段/本地docID的数字（它们从每个段的0开始），以将它们转换为索引/全局docID。
为Scorer提供collector.setScorer(scorer)实施 collector 可能在下一阶段使用它来获取文档的分数。虽然收藏家只计算结果，或对某些存储的字段进行排序，或者只是感觉如此 - scorer将被忽略。
收集器在Scorer实例上调用的唯一方法是scorer.score()，它应该返回当前正在收集的文档的分数（我小时候没有）。
使用与查询匹配的单调递增的段/本地docID序列反复调用collector.collect(id)。

回到你的代码 - 制作一个实现Scorer的包装器，使用一个实例和每次迭代时用simi更新的字段，让包装器的score()方法返回字段，在循环之前将此实例推送到带setScorer()的收集器。

您还需要lsa.searchOnDoc才能返回每段结果。

java lucene的语义搜索结果

1 个答案: