Question

我有两个包含数百万行的大型CSV文件。因为这两个CSV来自MySQL，我想将这两个表合并到沙发DB中的一个Document中。

最有效的方法是什么？

我目前的方法是：

导入第一个CSV
导入第二个CSV

为防止重复，程序将使用每行的键搜索Document。找到该行后，将使用第二个CSV

问题是，搜索每一行真的需要很长时间。导入第二个CSV时，它每秒更新30个文档，我有大约700万行。粗略计算，完成整个导入大约需要64个小时。

谢谢

Answer 1

听起来你有一个你从行中知道的“主键”（或者你可以从行计算它）。这是文档_id的理想选择。

问题是，如果您尝试添加第二个CSV数据但是已经存在具有相同409 Conflict的文档，则会获得_id。那是对的吗？（如果是这样，请纠正我，以便我能解决问题。）

我认为你有一个很好的答案：

使用_bulk_docs导入所有内容，然后修复冲突。

从干净的数据库开始。

使用Bulk docuent API插入第1个和第2个CSV集中的所有行 - 尽可能多地按HTTP查询，例如：一次1000个。（批量文档比逐个插入要快得多。）

始终在"all_or_nothing": true POST数据中添加_bulk_docs。这将保证每次插入都会成功（假设没有诸如断电或全高清等灾难）。

完成后，某些文档将冲突，这意味着您为相同的_id值插入了两次。那没问题。只需按照此过程合并两个版本：

对于有冲突的每个_id，请按GET /db/the_doc_id?conflicts=true从沙发中取出。
将冲突版本中的所有值合并到文档的新最终版本中。
将最终合并的文档提交到CouchDB并删除冲突的修订。请参阅conflict resolution上的CouchDB权威指南部分。（您也可以使用_bulk_docs加快速度。）

实施例

希望这会澄清一点。注意，我从http://github.com/iriscouch/manage_couchdb安装了* manage_couchdb * couchapp。它有一个简单的视图来显示冲突。

$ curl -XPUT -Hcontent-type:application/json localhost:5984/db
{"ok":true}

$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary @-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
            , "first_value": "This is the first value"
            }
          , { "_id": "some_id"
            , "second_value": "The second value is here"
            }
          ]
}
[{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"},{"id":"some_id","rev":"1-d1b74e67eee657f42e27614613936993"}]

$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":2,"offset":0,"rows":[
{"id":"some_id","key":["some_id","1-0cb8fd1fd7801b94bcd2f365ce4812ba"],"value":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba"},"doc":{"_id":"some_id","_rev":"1-0cb8fd1fd7801b94bcd2f365ce4812ba","first_value":"This is the first value"}},
{"id":"some_id","key":["some_id","1-d1b74e67eee657f42e27614613936993"],"value":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993"},"doc":{"_id":"some_id","_rev":"1-d1b74e67eee657f42e27614613936993","second_value":"The second value is here"}}
]}

$ curl -XPOST -Hcontent-type:application/json localhost:5984/db/_bulk_docs --data-binary @-
{ "all_or_nothing": true
, "docs": [ { "_id": "some_id"
            , "_rev": "1-0cb8fd1fd7801b94bcd2f365ce4812ba"
            , "first_value": "This is the first value"
            , "second_value": "The second value is here"
            }
          , { "_id": "some_id"
            , "_rev": "1-d1b74e67eee657f42e27614613936993"
            , "_deleted": true
            }
          ]
}
[{"id":"some_id","rev":"2-df5b9dc55e40805d7f74d1675af29c1a"},{"id":"some_id","rev":"2-123aab97613f9b621e154c1d5aa1371b"}]

$ curl localhost:5984/db/_design/couchdb/_view/conflicts?reduce=false\&include_docs=true
{"total_rows":0,"offset":0,"rows":[]}

$ curl localhost:5984/db/some_id?conflicts=true\&include_docs=true
{"_id":"some_id","_rev":"2-df5b9dc55e40805d7f74d1675af29c1a","first_value":"This is the first value","second_value":"The second value is here"}

最后两个命令显示没有冲突，“合并”文档现在作为“some_id”提供。

Answer 2

另一种选择只是做你已经做过的事情，但是使用批量文档API来提高性能。

对于每批文件：

使用以下内容向/db/_all_docs?include_docs=true发帖：

{ "keys": [ "some_id_1"
          , "some_id_2"
          , "some_id_3"
          ]
}

根据您获得的结果构建_bulk_docs更新。
- Doc已存在，您必须更新它：{"key":"some_id_1", "doc": {"existing":"data"}}
- Doc不存在，您必须创建它：{"key":"some_id_2", "error":"not_found"}

使用以下内容向/db/_bulk_docs发帖：

{ "docs": [ { "_id": "some_id_1"
            , "_rev": "the _rev from the previous query"
            , "existing": "data"
            , "perhaps": "some more data I merged in"
            }
          , { "_id": "some_id_2"
            , "brand": "new data, since this is the first doc creation"
            }
          ]
}

替代防止将CSV导入CouchDB时出现重复

2 个答案:

使用_bulk_docs导入所有内容，然后修复冲突。

实施例