Question

目前我有一个函数可以从 mongo 读取数据并将其导出到一个 json 文件中（转发到其他地方）。大文件 (>2Gb) 是一个问题，因为该脚本需要在内存限制设置为 2Gb 的 k8s pod 上运行。

def export_data(mongo, database, collection, metadata_id, file_path):

    cursor = mongo_find(mongo.client[database][collection], match={"metadata_id": ObjectId(metadata_id)})
    path = file_path + '/' + database + '/' + collection + '/'
    create_directory(path)

    ok = False

    try:
        with open(path + metadata_id + '.json', 'w') as file:
            file.write('{"documents":[')
            for document in cursor:
                if ok:
                    file.write(',')
                ok = True
                file.write(dumps(document))
            file.write(']}')
    except IOError as e:
        logging.error("Failed exporting %s to json. Error: %s", metadata_id, e.strerror)
        return False
    logging.info("%s was successfully saved at temp location", metadata_id)
    return True


def mongo_find(client, match={}):
    return list(client.find(match))

是否可以：

从 mongo 中分块读取数据？读取一个块，将其添加到存档文件中，然后读取另一个。

我知道 pymongo 的 find 的 batch_size 参数，但即使文档是分批获取的，它也会运行所有批次，然后返回输出，然后我将拥有一个包含所有文档的对象。最后，游标仍将整个结果保存在内存中。

将块直接写入 zip 文件，而无需写入中间 json 文件？

将数据从 Mongo 流式传输到 zip 文件

0 个答案: