Question

我们有一个简单的ETL过程来从API提取数据到我们想要使用函数实现的Document DB。简而言之，过程是获取~16,500行文件，从每行提取ID（功能1），为每个ID构建URL（功能2），使用URL命中API（功能3），存储响应在文档DB（功能4）中。我们正在使用队列进行功能间通信，并且在执行此操作时会看到第一个函数中的超时问题。

功能1（index.js）

module.exports = function (context, odsDataFile) {
  context.log('JavaScript blob trigger function processed blob \n Name:', context.bindingData.odaDataFile, '\n Blob Size:', odsDataFile.length, 'Bytes');

  const odsCodes = [];

  odsDataFile.split('\n').map((line) => {
    const columns = line.split(',');

    if (columns[12] === 'A') {
      odsCodes.push({
        'odsCode': columns[0],
        'orgType': 'pharmacy',
      });
    }
  });

  context.bindings.odsCodes = odsCodes;
  context.log(`A total of: ${odsCodes.length} ods codes have been sent to the queue.`);

  context.done();
};

function.json

{
  "bindings": [
    {
      "type": "blobTrigger",
      "name": "odaDataFile",
      "path": "input-ods-data",
      "connection": "connecting-to-services_STORAGE",
      "direction": "in"
    },
    {
      "type": "queue",
      "name": "odsCodes",
      "queueName": "ods-org-codes",
      "connection": "connecting-to-services_STORAGE",
      "direction": "out"
    }
  ],
  "disabled": false
}

完整代码here

当ID的数量在100的时候，此功能正常工作，但在1000的10的时间内超时。 ID数组的构建以毫秒为单位，并且函数完成但是将项添加到队列似乎需要花费很多分钟，并最终导致超时默认为5分钟。

我很惊讶填充队列的简单行为似乎花了这么长时间，并且函数的超时似乎包括函数外部任务的时间（即队列填充）。这是预期的吗？有更多高效的方法吗？

我们正在消费（动态）计划下运作。

Answer 1

我从本地计算机上对此进行了一些测试，发现将消息插入队列需要大约200ms，这是预期的。因此，如果要插入17k消息并按顺序执行，则需要时间：

17,000条消息* 200毫秒= 3,400,000毫秒或~56分钟

从云端运行时，延迟可能会稍微快一点，但是当你插入那么多消息时，你可以看到这会如何快速地超过5分钟。

如果消息排序并不重要，您可以并行插入消息。但有些警告：

你不能用节点做这个 - 它必须是C＃。 Node不会向您公开IAsyncCollector接口，因此它可以在幕后完成所有操作。
您无法并行插入所有内容，因为“消费计划”一次限制为250个网络连接。

以下是一次对插件200进行批量处理的示例 - 使用17k消息，在我的快速测试中花了不到一分钟。

public static async Task Run(string myBlob, IAsyncCollector<string> odsCodes, TraceWriter log)
{
    List<Task> tasks = new List<Task>();
    string[] lines = myBlob.Split(Environment.NewLine.ToCharArray(), StringSplitOptions.RemoveEmptyEntries);

    int skip = 0;
    int take = 200;

    IEnumerable<string> batch = lines.Skip(skip).Take(take);

    while (batch.Count() > 0)
    {
        await AddBatch(batch, odsCodes);
        skip += take;
        batch = lines.Skip(skip).Take(take);
    }
}

public static async Task AddBatch(IEnumerable<string> lines, IAsyncCollector<string> odsCodes)
{
    List<Task> tasks = new List<Task>();    
    foreach (string line in lines)
    {
        tasks.Add(odsCodes.AddAsync(line));
    }
    await Task.WhenAll(tasks);
}

Answer 2

正如其他答案所指出的，因为Azure队列没有批处理API，所以您应该考虑使用服务总线队列等替代方法。但是如果您坚持使用Azure队列，则需要避免顺序输出队列项，即需要某种形式的约束并行性。实现此目的的一种方法是使用TPL Dataflow library。

Dataflow必须使用批量任务并执行WhenAll（..）的一个优点是，您将永远不会有批处理几乎完成的情况，并且您正在等待一个慢速执行完成，然后再开始下一批处理。

我将10,000个项目与大小为32的任务批次和数据流的并行度设置为32进行了比较。批处理方法在60秒内完成，而数据流程几乎完成了一半（32秒）。

代码看起来像这样：

    using System.Threading.Tasks.Dataflow;
    ...
    var addMessageBlock = new ActionBlock<string>(async message =>
    {
        await odscodes.AddAsync(message);
    }, new ExecutionDataflowBlockOptions { SingleProducerConstrained = true, MaxDegreeOfParallelism = 32});

    var bufferBlock = new BufferBlock<string>();
    bufferBlock.LinkTo(addMessageBlock, new DataflowLinkOptions { PropagateCompletion = true });

    foreach(string line in lines)
        bufferBlock.Post(line);

    bufferBlock.Complete();
    await addMessageBlock.Completion;

填充队列时，Azure函数超时

2 个答案: