Question

我遇到使用python脚本导出到csv的问题。某些数组数据需要从Mongodb导出到CSV，但由于三个子字段数据被转储到列中，因此以下脚本无法正确导出。我想将answer字段下的三个字段（order，text，answerid）分成CSV中的三个不同列。

Mongodb的样本：

"answers": [
        {
            "order": 0,
            "text": {
                "en": "Yes"
            },
            "answerId": "527d65de7563dd0fb98fa28c"
        },
        {
            "order": 1,
            "text": {
                "en": "No"
            },
            "answerId": "527d65de7563dd0fb98fa28b"
        }
    ]

python脚本：

import csv
cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})
cursor = list(cursor)
with open('answer_2.csv', 'w') as outfile:   

    fields = ['_id','answers.order', 'answers.text', 'answers.answerid']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for x in cursor: 
        for y, v in x.iteritems():
            if y == 'answers'
                print (y, v)
                write.writerow(v)
                write.writerow(x)

Answer 1

所以......问题在于csv作家并不理解＆＃34; subdictionaries＆＃34;因为mongo返回它。

如果我理解正确，当你查询Mongo时，你会得到一个这样的字典：

{
   "_id": "a hex ID that correspond with the record that contains several answers",
   "answers": [ ... a list with a bunch of dicts in it... ]
}

因此，当csv.DictWriter尝试编写时，它只写一个字典（最顶层）。它不知道（或关心）answers是一个包含字典的列表，其值也需要用列写入（使用点符号（例如answers.order）访问字典中的字段是仅由Mongo理解，而不是由csv作者理解）

我明白你应该做的是＆＃34;步行＆＃34;答案列表，并在该列表中的每个记录（每个字典）中创建一个字典。获得＆＃34; flattened＆＃34; 词典列表后，您可以传递这些词典并将其写入csv文件中：

cursor = client.stack_overflow.stack_039.find(
    {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1})

# Step 1: Create the list of dictionaries (one dictionary per entry in the `answers` list)
flattened_records = []
for answers_record in cursor:
    answers_record_id = answers_record['_id']
    for answer_record in answers_record['answers']:
        flattened_record = {
            '_id': answers_record_id,
            'answers.order': answer_record['order'],
            'answers.text': answer_record['text'],
            'answers.answerId': answer_record['answerId']
        }
        flattened_records.append(flattened_record)

# Step 2: Iterate through the list of flattened records and write them to the csv file
with open('stack_039.csv', 'w') as outfile:
    fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId']
    write = csv.DictWriter(outfile, fieldnames=fields)
    write.writeheader()
    for flattened_record in flattened_records:
        write.writerow(flattened_record)

使用复数形式。 answers_record与answer_record

不同

创建一个这样的文件：

$ cat ./stack_039.csv
_id,answers.order,answers.text,answers.answerId
580f9aa82de54705a2520833,0,{u'en': u'Yes'},527d65de7563dd0fb98fa28c
580f9aa82de54705a2520833,1,{u'en': u'No'},527d65de7563dd0fb98fa28b

修改

您的查询（生成cursor = db.questions.find ({},{'_id':1, 'answers.order':1, 'answers.text':1, 'answers.answerId':1})的查询）将返回questions集合中的所有条目。如果此集合非常大，您可能希望将cursor用作iterator。

正如您可能已经意识到的，上面代码中的第一个for循环将所有记录放在一个列表中（flattened_records列表）。你可以通过遍历cursor来进行延迟加载（而不是加载内存中的所有项目，获取一个，用它做一些事情，获取下一个，用它做点什么......）。

它稍慢，但内存效率更高。

cursor = client.stack_overflow.stack_039.find( {}, {'_id': 1, 'answers.order': 1, 'answers.text': 1, 'answers.answerId': 1}) with open('stack_039.csv', 'w') as outfile: fields = ['_id', 'answers.order', 'answers.text', 'answers.answerId'] write = csv.DictWriter(outfile, fieldnames=fields) write.writeheader() for answers_record in cursor: # Here we are using 'cursor' as an iterator answers_record_id = answers_record['_id'] for answer_record in answers_record['answers']: flattened_record = { '_id': answers_record_id, 'answers.order': answer_record['order'], 'answers.text': answer_record['text'], 'answers.answerId': answer_record['answerId'] } write.writerow(flattened_record)

它将生成与上面显示的.csv文件相同的文件。

使用python将数据从mongodb导出到csv

1 个答案: