我是Lucene的新手。我有一些新闻文本,我使用以下字段编制索引:
doc.add(new Field("url", article.getUrl(), TextField.TYPE_STORED));
doc.add(new Field("source", article.getSource(), TextField.TYPE_STORED));
doc.add(new Field("title", article.getArticleTitle(), TextField.TYPE_STORED));
doc.add(new Field("content", article.getArticleContent(), TextField.TYPE_STORED));
doc.add(new Field("date", DateTools.dateToString(article.getArticleDate(), DateTools.Resolution.DAY), TextField.TYPE_STORED));
doc.add(new Field("type", article.getType().getName(), TextField.TYPE_STORED));
当我根据这些字段进行查询时,在某些情况下,同一文档会被返回两次。我使用以下代码查询索引:
StandardAnalyzer analyzer = new StandardAnalyzer(Version.LUCENE_40);
String contentQueryString = buildQuery(contentKeywords);
Query contentKeywordsQuery = null;
if (!StringUtils.isEmpty(contentQueryString)) {
contentKeywordsQuery = new QueryParser(Version.LUCENE_40, "content", analyzer).parse(contentQueryString);
}
Query titleKeywordsQuery = null;
String titleQueryString = buildQuery(titleKeywords);
if (!StringUtils.isEmpty(titleQueryString)) {
titleKeywordsQuery = new QueryParser(Version.LUCENE_40, "title", analyzer).parse(titleQueryString);
}
String sFrom = DateTools.dateToString(from, DateTools.Resolution.DAY);
String sTo = DateTools.dateToString(to, DateTools.Resolution.DAY);
Term lowerTerm = new Term("date", sFrom);
Term upperTerm = new Term("date", sTo);
Query dateQuery = new TermRangeQuery("date", lowerTerm.bytes(), upperTerm.bytes(), true, true);
Term sourceTerm = new Term("source", source);
Query sourceQuery = new TermQuery(sourceTerm);
Term typeTerm = new Term("type", type);
Query typeQuery = new TermQuery(typeTerm);
BooleanQuery q = new BooleanQuery();
q.add(dateQuery, BooleanClause.Occur.MUST);
q.add(sourceQuery, BooleanClause.Occur.MUST);
q.add(typeQuery, BooleanClause.Occur.MUST);
if (null != titleKeywordsQuery) {
q.add(titleKeywordsQuery, BooleanClause.Occur.MUST);
}
if (null != contentKeywordsQuery) {
q.add(contentKeywordsQuery, BooleanClause.Occur.MUST);
}
Directory index = new SimpleFSDirectory(new File("resources/lucene_index"));
int hitsPerPage = 5;
IndexReader reader = DirectoryReader.open(index);
IndexSearcher searcher = new IndexSearcher(reader);
TopScoreDocCollector collector = TopScoreDocCollector.create(hitsPerPage, true);
searcher.search(q, collector);
ScoreDoc[] hits = collector.topDocs().scoreDocs;
for (int i = 0; i < hits.length; ++i) {
int docId = hits[i].doc;
Document d = searcher.doc(docId);
QueryResult r = new QueryResult(d.get("url"), d.get("title"), d.get("date"), hits[i].score);
results.add(r);
}
这一行特别为同一份文件提供了两次点击:
ScoreDoc[] hits = collector.topDocs().scoreDocs;
索引中不应该有任何重复的文件,我已经检查过了。