Question

我们有一个程序，它可以持续运行，执行各种操作，并更改数据库中的某些记录。这些记录使用Lucene编制索引。因此，每次我们更改实体时，我们都会执行以下操作：

打开db transaction，打开Lucene IndexWriter
在事务中对db进行更改，并使用indexWriter.deleteDocuments(..)然后indexWriter.addDocument(..)在Lucene中更新该实体。
如果一切顺利，请提交db事务并提交IndexWriter。

这很好用，但随着时间的推移，indexWriter.commit()会花费越来越多的时间。最初它需要大约0.5秒，但在几百次此类交易之后需要超过3秒。如果脚本运行时间更长，我不会怀疑它会花更长的时间。

到目前为止，我的解决方案是评论indexWriter.addDocument(..)和indexWriter.commit()，并首先使用indexWriter.deleteAll()然后重新添加所有文档，一次又一次地重新创建整个索引。一个Lucene transction / IndexWriter（约14万秒内约250k文件）。但这显然违背了数据库和Lucene提供的事务方法，它使两者保持同步，并使用Lucene搜索的工具用户可以看到对数据库的更新。

我可以在14秒内添加250k文档，但添加1个文档需要3秒钟，这似乎很奇怪。我做错了什么，我怎样才能改善这种状况？

Answer 1

你做错了是假设Lucene的built-in transactional capabilities在they really don't时具有与典型关系数据库相当的性能和保证。更具体地说，在您的情况下，提交将所有索引文件与磁盘同步，使提交时间与索引大小成比例。这就是为什么随着时间的推移indexWriter.commit()需要花费越来越多的时间。 IndexWriter.commit()的{{3}}甚至警告：

这可能是一项代价高昂的操作，因此您应该测试您的成本应用程序，只在真正需要时才这样做。

你能想象数据库文档告诉你不要做提交吗？

由于您的主要目标似乎是通过Lucene及时搜索来保持数据库更新，以改善这种情况，请执行以下操作：

在成功提交数据库之后触发indexWriter.deleteDocuments(..)和indexWriter.addDocument(..)，而不是之前
定期执行indexWriter.commit()而不是每笔交易，只是为了确保您的更改最终写入磁盘
使用Javadoc进行搜索并定期调用SearcherManager以在合理的时间范围内查看更新的文档

以下是一个示例程序，演示了如何通过定期执行maybeRefresh()来检索文档更新。它构建了一个包含100000个文档的索引，使用maybeRefresh()设置commit()和maybeRefresh()的定期调用，提示您更新单个文档，然后重复搜索直到更新可见。在程序终止时正确清理所有资源。请注意，更新何时可见的控制因素是调用maybeRefresh()时，而不是commit()。

import java.io.IOException;
import java.nio.file.Paths;
import java.util.Scanner;
import java.util.concurrent.*;
import org.apache.lucene.analysis.standard.StandardAnalyzer;
import org.apache.lucene.document.*;
import org.apache.lucene.index.*;
import org.apache.lucene.search.*;
import org.apache.lucene.store.FSDirectory;

public class LucenePeriodicCommitRefreshExample {
    ScheduledExecutorService scheduledExecutor;
    MyIndexer indexer;
    MySearcher searcher;

    void init() throws IOException {
        scheduledExecutor = Executors.newScheduledThreadPool(3);
        indexer = new MyIndexer();
        indexer.init();
        searcher = new MySearcher(indexer.indexWriter);
        searcher.init();
    }

    void destroy() throws IOException {
        searcher.destroy();
        indexer.destroy();
        scheduledExecutor.shutdown();
    }

    class MyIndexer {
        IndexWriter indexWriter;
        Future commitFuture;

        void init() throws IOException {
            indexWriter = new IndexWriter(FSDirectory.open(Paths.get("C:\\Temp\\lucene-example")), new IndexWriterConfig(new StandardAnalyzer()));
            indexWriter.deleteAll();
            for (int i = 1; i <= 100000; i++) {
                add(String.valueOf(i), "whatever " + i);
            }
            indexWriter.commit();
            commitFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    indexWriter.commit();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 5, 5, TimeUnit.MINUTES);
        }

        void add(String id, String text) throws IOException {
            Document doc = new Document();
            doc.add(new StringField("id", id, Field.Store.YES));
            doc.add(new StringField("text", text, Field.Store.YES));
            indexWriter.addDocument(doc);
        }

        void update(String id, String text) throws IOException {
            indexWriter.deleteDocuments(new Term("id", id));
            add(id, text);
        }

        void destroy() throws IOException {
            commitFuture.cancel(false);
            indexWriter.close();
        }
    }

    class MySearcher {
        IndexWriter indexWriter;
        SearcherManager searcherManager;
        Future maybeRefreshFuture;

        public MySearcher(IndexWriter indexWriter) {
            this.indexWriter = indexWriter;
        }

        void init() throws IOException {
            searcherManager = new SearcherManager(indexWriter, true, null);
            maybeRefreshFuture = scheduledExecutor.scheduleWithFixedDelay(() -> {
                try {
                    searcherManager.maybeRefresh();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }, 0, 5, TimeUnit.SECONDS);
        }

        String findText(String id) throws IOException {
            IndexSearcher searcher = null;
            try {
                searcher = searcherManager.acquire();
                TopDocs topDocs = searcher.search(new TermQuery(new Term("id", id)), 1);
                return searcher.doc(topDocs.scoreDocs[0].doc).getField("text").stringValue();
            } finally {
                if (searcher != null) {
                    searcherManager.release(searcher);
                }
            }
        }

        void destroy() throws IOException {
            maybeRefreshFuture.cancel(false);
            searcherManager.close();
        }
    }

    public static void main(String[] args) throws IOException {
        LucenePeriodicCommitRefreshExample example = new LucenePeriodicCommitRefreshExample();
        example.init();
        Runtime.getRuntime().addShutdownHook(new Thread() {
            @Override
            public void run() {
                try {
                    example.destroy();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        });

        try (Scanner scanner = new Scanner(System.in)) {
            System.out.print("Enter a document id to update (from 1 to 100000): ");
            String id = scanner.nextLine();
            System.out.print("Enter what you want the document text to be: ");
            String text = scanner.nextLine();
            example.indexer.update(id, text);
            long startTime = System.nanoTime();
            String foundText;
            do {
                foundText = example.searcher.findText(id);
            } while (!text.equals(foundText));
            long elapsedTimeMillis = TimeUnit.NANOSECONDS.toMillis(System.nanoTime() - startTime);
            System.out.format("it took %d milliseconds for the searcher to see that document %s is now '%s'\n", elapsedTimeMillis, id, text);
        } catch (Exception e) {
            e.printStackTrace();
        } finally {
            System.exit(0);
        }
    }
}

使用Lucene 5.3.1和JDK 1.8.0_66成功测试了此示例。

Answer 2

我的第一种方法：不要经常这样做。删除并重新添加文档时，可能会触发合并。合并有点慢。

如果您使用近乎实时的IndexReader，您仍然可以像以前一样进行搜索（它不会显示已删除的文档），但是您不会受到提交惩罚。您可以随时提交，以确保文件系统与索引同步。您可以在使用索引时执行此操作，因此您不必阻止所有其他操作。

另见这篇有趣的blog post（并阅读其他帖子，它们提供了很好的信息）。

写入Lucene索引，一次一个文档，随着时间的推移逐渐减慢

2 个答案: