Question

我已经连续运行Azure webjob，它基于队列触发器触发。队列包含需要写入lucene索引的项目列表。我目前队列中有很多项目（超过500k行项目），我正在寻找最有效的方法来处理它。当我试图“缩小”webjob时，我一直得到IndexWriter Lock异常。

当前设置：

JobHostConfiguration config = new JobHostConfiguration();
            config.Queues.BatchSize = 1;

            var host = new JobHost(config);                        
            host.RunAndBlock();

网络工作职能

     public static void AddToSearchIndex([QueueTrigger("indexsearchadd")] List<ListingItem> items, TextWriter log)
                {
                    var azureDirectory = new AzureDirectory(CloudStorageAccount.Parse(ConfigurationManager.ConnectionStrings["StorageConnectionString"].ConnectionString), "megadata");
                    var findexExists = IndexReader.IndexExists(azureDirectory);
                    var count = items.Count;
                    IndexWriter indexWriter = null;
                    int errors = 0;
                    while (indexWriter == null && errors < 10)
                    {
                        try
                        {
                            indexWriter = new IndexWriter(azureDirectory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), !IndexReader.IndexExists(azureDirectory), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
                        }
                        catch (LockObtainFailedException)
                        {
                            log.WriteLine("Lock is taken, Hit 'Y' to clear the lock, or anything else to try again");
                            errors++;
                        }
                    };
                    if (errors >= 10)
                    {
                        azureDirectory.ClearLock("write.lock");
                        indexWriter = new IndexWriter(azureDirectory, new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30), !IndexReader.IndexExists(azureDirectory), new Lucene.Net.Index.IndexWriter.MaxFieldLength(IndexWriter.DEFAULT_MAX_FIELD_LENGTH));
 log.WriteLine("IndexWriter lock obtained, this process has exclusive write access to index");
            indexWriter.SetRAMBufferSizeMB(10.0);
            // Parallel.ForEach(items, (itm) =>
            //{
            foreach (var itm in items)
            {
                AddtoIndex(itm, indexWriter);
            }
            //});
    }

更新索引项的方法基本上如下所示：

private static void AddtoIndex(ListingItem item, IndexWriter indexWriter)
        {            
            var doc = new Document();
            doc.Add(new Field("id", item.URL, Field.Store.NO, Field.Index.NOT_ANALYZED, Field.TermVector.NO));
            var title = new Field("Title", item.Title, Field.Store.YES, Field.Index.ANALYZED, Field.TermVector.YES);
 indexWriter.UpdateDocument(new Term("id", item.URL), doc);
}

我尝试过的事情：

将azure配置批量大小设置为最大32
使方法异步并使用Task.WhenAll
使用parallel for loop

当我尝试上述操作时，通常会失败：

Lucene.Net.Store.LockObtainFailedException: Lucene.Net.Store.LockObtainFailedException: Lock obtain timed out: AzureLock@write.lock.
 at Lucene.Net.Store.Lock.Obtain(Int64 lockWaitTimeout) in d:\Lucene.Net\FullRepo\trunk\src\core\Store\Lock.cs:line 97
 at Lucene.Net.Index.IndexWriter.Init(Directory d, Analyzer

有关如何在架构上设置此Web作业的任何建议，以便它可以处理队列中的更多项目而不是逐个执行此操作？他们需要写入相同的索引吗？感谢

Answer 1

当多个进程同时尝试写入Lucene索引时，您遇到了Lucene语义问题。使用Tasks或parallel for循环缩放azure应用程序只会导致问题，因为当时只有一个进程应该写入Lucene索引。

建筑这是你应该做的。

确保只有一个webjobs实例随时运行 - 甚至如果Web App缩放（例如通过自动缩放）
使用最大webjob批量大小（32）
在每批次之后提交Lucene索引以最小化I / O

通过将settings.job文件添加到webjob项目，确保只能完成一个webjob实例。将构建操作设置为内容并复制到输出目录。将以下JSON添加到文件

{ "is_singleton": true }

将webjob批处理站点配置为最大值

JobHostConfiguration config = new JobHostConfiguration();
config.Queues.BatchSize = 1;
var host = new JobHost(config);                        
host.RunAndBlock();

在每批后提交Lucene索引

public static void AddToSearchIndex([QueueTrigger("indexsearchadd")] List<ListingItem> items, TextWriter log)
{
    ...
    indexWriter = new IndexWriter(azureDirectory, …);

    foreach (var itm in items)
    {
        AddtoIndex(itm, indexWriter);
    }
    indexWriter.Commit();
}

这将仅在提交Lucene索引时写入存储帐户，从而加快索引过程。此外，webjob批处理还将加快消息处理（一段时间内处理的消息数量，而不是单个消息处理时间）。

您可以添加检查以查看Lucene索引是否已锁定（write.lock文件是否存在）并在批处理开始时解锁索引。这应该永远不会发生，但一切都会发生，所以我会添加它以确定。

您可以通过使用更大的Web App实例（里程可能会有所不同）进一步加快索引过程，并使用更快的存储，如Azure高级存储。

您可以详细了解internals of Lucene indexes on Azure on my blog。

lucene.net IndexWriter和Azure WebJob

1 个答案: