Cosmos插件无法有效并行化

时间:2019-10-11 21:22:56

标签: azure azure-cosmosdb

元问题:
我们正在从EventHub提取数据,运行一些逻辑,然后将其保存到宇宙中。当前,Cosmos插件是我们的瓶颈。 我们如何最大程度地提高吞吐量?

详细信息
我们正在尝试优化Cosmos吞吐量,并且SDK中似乎存在一些争用,使得并行插入仅比串行插入快一点。
我们在逻辑上在做:

            for (int i = 0; i < insertCount; i++)
            {
                taskList.Add(InsertCosmos(sdkContainerClient));
            }
            var parallelTimes = await Task.WhenAll(taskList);

以下是比较串行插入,并行插入和“伪造”插入(使用Task.Delay)的结果:

Serial took: 461ms for 20
 - Individual times 28,8,117,19,14,11,10,12,5,8,9,11,18,15,79,23,14,16,14,13

Cosmos Parallel
Parallel took: 231ms for 20
 - Individual times 17,15,23,39,45,52,72,74,80,91,96,98,108,117,123,128,139,146,147,145

Just Parallel (no cosmos)
Parallel took: 27ms for 20
 - Individual times 27,26,26,26,26,26,26,25,25,25,25,25,25,24,24,24,23,23,23,23
  • 序列很明显(只需添加每个值)
  • 没有宇宙(最后一个时间)也很明显(只需花费最短的时间)
  • 但是并行宇宙并没有并行化,表明存在一些争用。

我们正在Azure的VM(与Cosmos相同的数据中心)上的VM上运行此程序,具有足够的RU,因此不会获得429s,并使用Microsoft.Azure.Cosmos 3.2.0。

完整代码示例

    class Program
    {
        public static void Main(string[] args)
        {
            CosmosWriteTest().Wait();
        }

        public static async Task CosmosWriteTest()
        {
            var cosmosClient = new CosmosClient("todo", new CosmosClientOptions { ConnectionMode = ConnectionMode.Direct });
            var database = cosmosClient.GetDatabase("<ourcontainer>");
            var sdkContainerClient = database.GetContainer("<ourcontainer>");
            int insertCount = 25;
            //Warmup
            await sdkContainerClient.CreateItemAsync(new TestObject());

            //---Serially inserts into Cosmos---
            List<long> serialTimes = new List<long>();
            var serialTimer = Stopwatch.StartNew();
            Console.WriteLine("Cosmos Serial");
            for (int i = 0; i < insertCount; i++)
            {
                serialTimes.Add(await InsertCosmos(sdkContainerClient));
            }
            serialTimer.Stop();
            Console.WriteLine($"Serial took: {serialTimer.ElapsedMilliseconds}ms for {insertCount}");
            Console.WriteLine($" - Individual times {string.Join(",", serialTimes)}");

            //---Parallel inserts into Cosmos---
            Console.WriteLine(Environment.NewLine + "Cosmos Parallel");
            var parallelTimer = Stopwatch.StartNew();
            var taskList = new List<Task<long>>();
            for (int i = 0; i < insertCount; i++)
            {
                taskList.Add(InsertCosmos(sdkContainerClient));
            }
            var parallelTimes = await Task.WhenAll(taskList);

            parallelTimer.Stop();
            Console.WriteLine($"Parallel took: {parallelTimer.ElapsedMilliseconds}ms for {insertCount}");
            Console.WriteLine($" - Individual times {string.Join(",", parallelTimes)}");

            //---Testing parallelism minus cosmos---
            Console.WriteLine(Environment.NewLine + "Just Parallel (no cosmos)");
            var justParallelTimer = Stopwatch.StartNew();
            var noCosmosTaskList = new List<Task<long>>();
            for (int i = 0; i < insertCount; i++)
            {
                noCosmosTaskList.Add(InsertCosmos(sdkContainerClient, true));
            }
            var justParallelTimes = await Task.WhenAll(noCosmosTaskList);

            justParallelTimer.Stop();
            Console.WriteLine($"Parallel took: {justParallelTimer.ElapsedMilliseconds}ms for {insertCount}");
            Console.WriteLine($" - Individual times {string.Join(",", justParallelTimes)}");
        }

        //inserts 
        private static async Task<long> InsertCosmos(Container sdkContainerClient, bool justDelay = false)
        {
            var timer = Stopwatch.StartNew();
            if (!justDelay)
                await sdkContainerClient.CreateItemAsync(new TestObject());
            else
                await Task.Delay(20);

            timer.Stop();
            return timer.ElapsedMilliseconds;
        }

        //Test object to save to Cosmos
        public class TestObject
        {
            public string id { get; set; } = Guid.NewGuid().ToString();
            public string pKey { get; set; } = Guid.NewGuid().ToString();
            public string Field1 { get; set; } = "Testing this field";
            public double Number { get; set; } = 12345;
        }
    }

1 个答案:

答案 0 :(得分:1)

这是引入批量的情况。批量模式目前处于预览状态,并且可以在3.2.0-preview2软件包中使用。

要利用批量处理,您需要做的是打开AllowBulkExecution标志:

new CosmosClient(endpoint, authKey, new CosmosClientOptions() { AllowBulkExecution = true } );

使此模式受益于您描述的这种情况,该情况是需要吞吐量的并发操作的列表。

我们在这里有一个示例项目:https://github.com/Azure/azure-cosmos-dotnet-v3/tree/master/Microsoft.Azure.Cosmos.Samples/Usage/BulkSupport

并且我们仍在处理官方文档,但是我们的想法是,当发出并发操作时,SDK会根据分区关联性将它们分组并执行,而不是像您现在看到的那样将它们作为单独的请求执行分组操作(分批),从而减少了后端服务调用,并根据操作量而将吞吐量提高了50%-100%之间。此模式将消耗更多的RU / s ,因为它每秒推送的操作量要大于单独发出的操作量(因此,如果您达到429s,则意味着瓶颈已位于预配置的RU / s上) 。

var cosmosClient = new CosmosClient("todo", new CosmosClientOptions { AllowBulkExecution = true });
var database = cosmosClient.GetDatabase("<ourcontainer>");
var sdkContainerClient = database.GetContainer("<ourcontainer>");
//The more operations the better, just 25 might not yield a great difference vs non bulk
int insertCount = 10000;
//Don't do any warmup

List<Task> operations = new List<Tasks>();
var timer = Stopwatch.StartNew();
for (int i = 0; i < insertCount; i++)
{
    operations.Add(sdkContainerClient.CreateItemAsync(new TestObject()));
}

await Task.WhenAll(operations);
serialTimer.Stop();

重要提示:该功能仍在预览中。由于这是一种针对吞吐量(而非延迟)进行了优化的模式,因此您执行的任何单个操作都不会带来很大的操作延迟。

如果您想进一步优化,并且您的数据源允许您访问Streams(避免序列化),则可以使用CreateItemStream SDK方法以获得更高的吞吐量。