Question

我想删除apache lucene中只有完全匹配的文档。例如，我有包含文字的文件：

  Document1: Bilal
  Document2: Bilal Ahmed
  Doucument3: Bilal Ahmed - 54

当尝试使用查询'Bilal'删除文档时，它会删除所有这三个文档，而它应该只删除第一个完全匹配的文档。

我使用的代码是：

    String query = "bilal";
    String field = "userNames";

    Term term = new Term(field, query);

    IndexWriter indexWriter = null;

    File indexDir = new File(idexedDirectory);
    Directory directory = FSDirectory.open(indexDir);

    Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_46);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);

    indexWriter = new IndexWriter(directory, iwc);        

    indexWriter.deleteDocuments(term);
    indexWriter.close();

这就是我索引文档的方式：

    File indexDir = new File("C:\\Local DB\\TextFiled");
    Directory directory = FSDirectory.open(indexDir);

    Analyzer  analyzer = new StandardAnalyzer(Version.LUCENE_46);
    IndexWriterConfig iwc = new IndexWriterConfig(Version.LUCENE_46, analyzer);              

   //Thirdly We tell the Index Writer that which document to index
   indexWriter = new IndexWriter(directory, iwc);

    int i = 0;

    try (DataSource db = DataSource.getInstance()) {

        PreparedStatement ps = db.getPreparedStatement(
                "SELECT user_id, username FROM " + TABLE_NAME + " as au" + User_CONDITION);

        try (ResultSet resultSet = ps.executeQuery()) {

            while (resultSet.next()) {
                i++;
                doc = new Document();

                text = resultSet.getString("username");                    
                doc.add(new StringField("userNames", text, Field.Store.YES));

                indexWriter.addDocument(doc);
                System.out.println("User Name : " + text + " : " + userID);
            }
        }

Answer 1

您错过了提供索引这些文档的方式。如果使用StandardAnalyzer对其进行索引并启用了标记化，则可以理解您获得这些结果 - 这是因为StandardAnalyzer会为每个单词标记文本，因为每个文档都包含Bilal ，你打了所有这些文件。

一般建议是，您应该始终添加唯一的ID字段，并通过此ID字段进行查询/删除。

如果你不能这样做 - 将相同的文本索引为单独的字段 - 没有标记化 - 并使用短语查询来找到完全匹配，但这对我来说听起来像是一个可怕的黑客。

删除具有完全匹配的apache lucene中的文档

1 个答案: