从Lucene Index查询时,同一文档被返回两次

时间:2012-11-14 10:56:05

标签: lucene

我是Lucene的新手。我有一些新闻文本,我使用以下字段编制索引:

doc.add(new Field("url", article.getUrl(), TextField.TYPE_STORED));
doc.add(new Field("source", article.getSource(), TextField.TYPE_STORED));
doc.add(new Field("title", article.getArticleTitle(), TextField.TYPE_STORED));
doc.add(new Field("content", article.getArticleContent(), TextField.TYPE_STORED));
doc.add(new Field("date", DateTools.dateToString(article.getArticleDate(), DateTools.Resolution.DAY), TextField.TYPE_STORED));
doc.add(new Field("type", article.getType().getName(), TextField.TYPE_STORED));

当我根据这些字段进行查询时,在某些情况下,同一文档会被返回两次。我使用以下代码查询索引:

StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);

String contentQueryString = buildQuery(contentKeywords);
Query contentKeywordsQuery = null;
if (!StringUtils.isEmpty(contentQueryString)) {
    contentKeywordsQuery = new QueryParser(Version.LUCENE_40, "content", analyzer).parse(contentQueryString);
}

Query titleKeywordsQuery = null;
String titleQueryString = buildQuery(titleKeywords);
if (!StringUtils.isEmpty(titleQueryString)) {
    titleKeywordsQuery = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(titleQueryString);
}

String sFrom = DateTools.dateToString(from, DateTools.Resolution.DAY);
String sTo = DateTools.dateToString(to, DateTools.Resolution.DAY);
Term lowerTerm = new Term("date", sFrom);
Term upperTerm = new Term("date", sTo);
Query dateQuery = new TermRangeQuery("date", lowerTerm.bytes(), upperTerm.bytes(), true, true);

Term sourceTerm = new Term("source", source);
Query sourceQuery = new TermQuery(sourceTerm);

Term typeTerm = new Term("type", type);
Query typeQuery = new TermQuery(typeTerm);

BooleanQuery q = new BooleanQuery();
q.add(dateQuery, BooleanClause.Occur.MUST);
q.add(sourceQuery, BooleanClause.Occur.MUST);
q.add(typeQuery, BooleanClause.Occur.MUST);
if (null != titleKeywordsQuery) {
    q.add(titleKeywordsQuery, BooleanClause.Occur.MUST);
}
if (null != contentKeywordsQuery) {
    q.add(contentKeywordsQuery, BooleanClause.Occur.MUST);
}
Directory index = new SimpleFSDirectory(new File("resources/lucene_index"));
int hitsPerPage = 5;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;

for (int i = 0; i < hits.length; ++i) {
    int docId = hits[i].doc;
    Document d = searcher.doc(docId);
    QueryResult r = new QueryResult(d.get("url"), d.get("title"), d.get("date"), hits[i].score);
    results.add(r);
}

这一行特别为同一份文件提供了两次点击:

ScoreDoc[] hits = collector.topDocs().scoreDocs;

索引中不应该有任何重复的文件,我已经检查过了。

0 个答案:

没有答案