Lucene DuplicateFilter问题

时间:2010-05-30 21:25:43

标签: lucene

为什么DuplicateFilter不能与其他过滤器一起使用?例如,如果对测试DuplicateFilterTest稍微重新制作,那么过滤器未应用于其他过滤器并首先修剪结果的印象:

    public void testKeepsLastFilter()
            throws Throwable {
        DuplicateFilter df = new DuplicateFilter(KEY_FIELD);
        df.setKeepMode(DuplicateFilter.KM_USE_LAST_OCCURRENCE);

        Query q = new ConstantScoreQuery(new ChainedFilter(new Filter[]{
                new QueryWrapperFilter(tq),
                // new QueryWrapperFilter(new TermQuery(new Term("text", "out"))), // works right, it is the last document.
                new QueryWrapperFilter(new TermQuery(new Term("text", "now"))) // why it doesn't work? It is the third document, but hits count is 0.

        }, ChainedFilter.AND));

        // this varians doesn't hit too:
        // ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, df), new QueryWrapperFilter(new TermQuery(new Term("text", "now"))), 1000).scoreDocs;
        // ScoreDoc[] hits = searcher.search(new FilteredQuery(tq, new QueryWrapperFilter(new TermQuery(new Term("text", "now")))), df, 1000).scoreDocs;

        ScoreDoc[] hits = searcher.search(q, df, 1000).scoreDocs;

        assertTrue("Filtered searching should have found some matches", hits.length > 0);
        for (int i = 0; i < hits.length; i++) {
            Document d = searcher.doc(hits[i].doc);
            String url = d.get(KEY_FIELD);
            TermDocs td = reader.termDocs(new Term(KEY_FIELD, url));
            int lastDoc = 0;
            while (td.next()) {
                lastDoc = td.doc();
            }
            assertEquals("Duplicate urls should return last doc", lastDoc, hits[i].doc);
        }
    }

1 个答案:

答案 0 :(得分:2)

DuplicateFilter 独立构造一个过滤器,该过滤器选择包含每个键的所有文档的第一个或最后一个出现。这可以通过最小的内存开销进行缓存。

您的第二个过滤器会独立选择其他一些文档。这两个选择可能不一致。要根据所有文档的某些任意子集过滤重复项,可能需要使用字段缓存才能实现性能,这是事情变得昂贵的RAM的方式