Question

我必须在CouchDB localhost中插入1000万个文档。我使用python脚本以这种格式创建随机数据：

curl -d @db.json -H "Content-type: application/json" -X POST http://127.0.0.1:5984/new/_bulk_docs

文件大小为1.5 GB，因为每个文档中有10个键值对。

我正在使用此命令加载json文件：

<div>
<div>I'm text</div>
<div>I'm text</div>
</div>

对于100,000个文档，它可以加载10-15秒，但对于10,000,000，它甚至不能在12小时内加载。

有关如何在couchDB中批量插入的任何帮助将不胜感激。

TIA

Answer 1

最后，我将文件分成100个文件，每个文件有0.1 M记录，并通过此命令上传到数据库。

FOR /L %i IN (0,1,9) DO (
    curl -d @dbn%i.json -H "Content-type: application/json" -X POST http://127.0.0.1:5984/new4/_bulk_docs
)

Answer 2

我不熟悉CouchDB批量API，但您已经提到过有100＆＃000; 000条记录的批量请求有效，所以我怀疑10＆000; 000＆＃39; 000太多了一个人去。

考虑将10＆＃000＆＃39; 000条记录的大文件拆分为100＆＃000; 000条记录的较小JSON文件，并使用单独的请求发布每个块/批：

import json

# Batch function from: https://stackoverflow.com/a/8290508/7663649
def batch(iterable, n=1):
    l = len(iterable)
    for ndx in range(0, l, n):
        yield iterable[ndx:min(ndx + n, l)]

BATCH_SIZE = 100000
with open("db.json") as input_file:
    for batch_index, batch_list in enumerate(
            batch(json.load(input_file), BATCH_SIZE)):
        with open("chunk_{}.json".format(batch_index), "w") as chunk_file:
            json.dump(batch_list, chunk_file)

如何通过curl在couchDB中插入数百万个文档？

2 个答案: