我有新闻报道的索引,其中我保存标题,链接,新闻描述。有时可能来自同一链接的相同新闻由不同的新闻来源以不同的标题发布。它不希望完全相同的描述文章被添加两次..如何查找文档是否已经存在?
答案 0 :(得分:4)
我假设您正在使用Java。 假设您的链接作为StringField保存在索引中(因此您使用的任何分析器都会将链接分解为多个术语),您可以使用TermQuery。
TopDocs results = searcher.search(new TermQuery(new Term("link", "http://example.com")), 1);
if (results.totalHits == 0){
Document doc = new Document();
// create your document here with your fields
// link field should be stored as a StringField
doc.add(new StringField("link", "http://example.com", Stored.YES));
writer.addDocument(doc);
}
请注意,StringFields的存储方式完全正确,因此您可能希望在搜索/索引时转换为小写。
如果您希望确保不存在超过1个字段,则可以使用Occur.SHOULD条件将其作为BooleanQuery运行:
BooleanQuery matchingQuery = new BooleanQuery();
matchingQuery.add(new TermQuery(new Term("link", "http://example.com")), Occur.SHOULD);
matchingQuery.add(new TermQuery(new Term("description", "the unique description of the article")), Occur.SHOULD);
TopDocs results = searcher.search(matchingQuery, 1);
if (results.totalHits == 0){
Document doc = new Document();
// create your document here with your fields
// link field should be stored as a StringField
doc.add(new StringField("link", "http://example.com", Stored.YES));
doc.add(new StringField("description", "the unique description of the article", Stored.YES));
// note if you need the description to be tokenized, you need to add another TextField to the document with a different field name
doc.add(new TextField("descriptionText", "the unique description of the article", Stored.NO));
writer.addDocument(doc);
}