Question

我正在使用下面的代码（框架）来摄取各种文档，例如电子邮件（可能带有附件），pdf，word文档等。下面的文档列表由另一段代码（未列出）填充，用于检查如果文档已被摄取。如果是这样，则不会将它们添加到列表中。该列表的最大大小为10。我正在Windows 10 Enterprise，16 GB和CPU Intel Xeon E5-2680 2.8 GHZ的VM上以POC形式运行此列表。现在，它运行了大约一周的时间，而我只摄取了大约300000个文档，这非常慢。

瓶颈在哪里？我应该增加列表的最大大小吗？您是否认为我的POC VM不够强大（即我是否需要适当的群集-Azure是否提供易于配置的功能）？我应该运行几个影响弹性搜索的过程吗？

任何反馈将不胜感激！谢谢！

using System;
using System.Collections.Generic;
using System.Configuration;
using System.IO;
using System.Linq;
using Indexer1;
using Nest;

var indexName = "documents";
var node = new Uri(ConfigurationManager.AppSettings["Search-Uri"]);
var settings = new ConnectionSettings(node).InferMappingFor<Document>(m => m.IndexName(indexName));
settings.ThrowExceptions(alwaysThrow: true); 
settings.PrettyJson(); // Good for DEBUG
var client = new ElasticClient(settings);

CreateIndex(client, indexName);

var documents = new List<Document>();

//目前有一些过程可将文档填充到最大10个大小

var bulkResponse = client.Bulk(b => b
    .Pipeline("attachments")
    .IndexMany(documents)
);

private static void CreateIndex(ElasticClient client, string indexName)
{
    if (!client.IndexExists(indexName).Exists)
    {
    var indexResponse = client.CreateIndex(indexName, c => c
        .Settings(s => s
        .Analysis(a => a
            .Analyzers(ad => ad
            .Custom("windows_path_hierarchy_analyzer", ca => ca
                .Tokenizer("windows_path_hierarchy_tokenizer")
            )
            )
            .Tokenizers(t => t
            .PathHierarchy("windows_path_hierarchy_tokenizer", ph => ph
                .Delimiter('\\')
            )
            )
        )
        )
        .Mappings(m => m
        .Map<Document>(mp => mp
            .AutoMap()
            .AllField(all => all
            .Enabled(false)
            )
            .Properties(ps => ps
            .Object<Attachment>(a => a
                .Name(n => n.Attachment)
                .AutoMap()
            )
            )
        )
        )
    );
    }
}

POC实施将文档导入Elastic Search的过程非常缓慢

0 个答案: