Question

我正在尝试从CosmosDB集合（MachineCollection）中读取大量数据（58 GB数据；索引大小为9 GB）。吞吐量设置为1000 RU / s。集合按序列号，读位置（西欧，北欧）和写位置（西欧）划分。与我的阅读尝试同时，TestDB每20秒就会收到一次数据。

问题是我无法通过Python查询任何数据。如果在CosmosDB数据资源管理器上执行查询，我将很快获得结果。（例如查询某个序列号）。

出于故障排除的目的，我创建了一个新的数据库（TestCollection）和一个TestCollection。在此MachineCollection中，有MachineCollection的10个数据集。如果我尝试通过Python读取此options['enableCrossPartitionQuery'] = True Querying using PartitionKey: options['partitionKey'] = 'certainSerialnumber'，那么它将成功执行，并且能够将数据保存为CSV。

这使我想知道为什么在配置具有完全相同属性的TestDB和TestCollection时无法从MachineCollection查询数据。

我已经尝试通过Python查询的内容：

int main() {
    /* Enter your code here. Read input from STDIN. Print output to STDOUT */

    int n, q;
    scanf("%d%d", &n, &q);

    int *p_arr[n];

    if (n > 0) {
        for (int i = 0; i <  n; i++) {
            int tmp;
            scanf("%d", &tmp);

            int tmp_arr[tmp];
            p_arr[i] = tmp_arr;
            for (int j = 0; j < tmp; j++) {
                int value;
                scanf("%d", &value);
                p_arr[i][j] = value;
                printf("%d ", p_arr[i][j]);
            }
            printf("\n");
        }
    }

    if (q > 0) {
        for (int i = 0; i < q; i++) {
            int row, col;
            scanf("%d%d", &row, &col);
            printf ("%d %d\n", row, col);
            int answer = p_arr[row][col];
            printf("%d\n", answer);

        }
    }
    return 0;

}

一如既往。适用于TestCollection，但不适用于MachineCollection。

任何有关如何解决此问题的想法都将受到赞赏！

Answer 1

首先，您需要了解的是Document DB对Response page size施加了限制。该链接总结了其中一些限制：Azure DocumentDb Storage Limits - what exactly do they mean?

第二，如果要从Document DB查询大数据，则必须考虑查询性能问题，请参阅本文：Tuning query performance with Azure Cosmos DB。

通过查看Document DB REST API，您可以观察到几个对查询操作有重大影响的重要参数：x-ms-max-item-count, x-ms-continuation.

我知道，Azure门户网站不会自动帮助您优化SQL，因此您需要在sdk或rest api中进行处理。

您可以设置值为Max Item Count，并使用continuation token对数据进行分页。 Document Db sdk支持无缝读取分页数据。您可以参考以下python代码段：

q = client.QueryDocuments(collection_link, query, {'maxItemCount':10})
results_1 = q._fetch_function({'maxItemCount':10})
#this is a string representing a JSON object
token = results_1[1]['x-ms-continuation']
results_2 = q._fetch_function({'maxItemCount':10,'continuation':token})

另一种情况，您可以参考：How do I set continuation tokens for Cosmos DB queries sent by document_client objects in Python?

如何解决Azure CosmosDB上仅在具有大数据的集合上发生的查询问题？

1 个答案: