我正在编写一个Python脚本,该脚本从多个JSON文件读取数据并将其写入单个输出CSV文件。我已经写了一些代码,但这是不正确的。为了简化起见,我在这里格式化了JSON,否则它在一行中。每个“ requestId”包含多个“ id”值。我当前的代码只能读取一个“ id”,并重复200次。不知道为什么会这样。
JSON文件
{
"success":true,
"errors":[
],
"requestId":"3561c",
"result":[
{
"id":257268,
"name":"02 ",
"description":"",
"createdAt":"2017-10-06T11:29:40Z+0000",
"updatedAt":"2017-11-07T13:38:11Z+0000",
"url":"https",
"subject":{
"type":"Text",
"value":"Are you ready"
},
"fromName":{
"type":"Text",
"value":"Centre"
},
"fromEmail":{
"type":"Text",
"value":"abc@xyz.com"
},
"replyEmail":{
"type":"Text",
"value":"noreply@xwz.com"
},
"folder":{
"type":"Folder",
"value":8041,
"folderName":"A"
},
"operational":false,
"textOnly":false,
"publishToMSI":false,
"webView":false,
"status":"approved",
"template":681,
"workspace":"R",
"version":1,
"autoCopyToText":false
},
{
"id":257273,
"name":"02a",
"description":"",
"createdAt":"2017-10-06T11:29:46Z+0000",
"updatedAt":"2017-11-07T13:38:19Z+0000",
"url":"https:",
"subject":{
"type":"Text",
"value":"Still have questions?"
},
"fromName":{
"type":"Text",
"value":"Centre"
},
"fromEmail":{
"type":"Text",
"value":"abc@xyz.com"
},
"replyEmail":{
"type":"Text",
"value":"noreply@xwz.com"
},
"folder":{
"type":"Folder",
"value":8041,
"folderName":"A"
},
"operational":false,
"textOnly":false,
"publishToMSI":false,
"webView":false,
"status":"approved",
"template":681,
"workspace":"R",
"version":1,
"autoCopyToText":false },
Python代码
import json
import csv
import os
import codecs
import sys
reload(sys)
sys.setdefaultencoding('utf8')
file_dir = os.path.normpath('/home/pp/jobs/staging/')
exp_dir = os.path.normpath('/home/pp/jobs/CSV/')
exp_file_name = 'emails.csv'
exp_path = os.path.join(exp_dir, exp_file_name)
my_dict_list =[]
try:
for f in os.listdir(file_dir):
if f.endswith('.json') and f.startswith('emails_'):
file_path = os.path.join(file_dir, f)
data = open(file_path, 'r')
for line in data:
my_dict = {}
parsed_data = json.loads(line)
my_dict["REQUEST_ID"] = parsed_data["requestId"]
my_dict["SUCCESS"] = parsed_data["success"]
for result in parsed_data["result"]:
my_dict["RESULT_ID"] = result["id"]
my_dict["NAME"] = result["name"]
my_dict["DESCRIPTION"] = result.get("description")
my_dict["STATUS"] = result["status"].encode('utf-8')
my_dict["FOLDER_TYPE"] = result["folder"]["type"]
my_dict["FOLDER_ID"] = result["folder"]["value"]
my_dict["FOLDER_NAME"] = result["folder"]["folderName"]
my_dict["FROM_EMAIL_TYPE"] = result["fromEmail"]["type"]
my_dict["FROM_EMAIL_VALUE"] = result["fromEmail"]["value"]
my_dict["FROM_NAME_TYPE"] = result["fromName"]["type"]
my_dict["FROM_NAME_VALUE"] = result["fromName"]["value"]
my_dict["REPLY_EMAIL_TYPE"] = result["replyEmail"]["type"]
my_dict["REPLY_EMAIL_VALUE"] = result["replyEmail"]["value"]
my_dict["SUBJECT_TYPE"] = result["subject"]["type"]
my_dict["SUBJECT_VALUE"] = result["subject"]["value"]
my_dict["OPERATIONAL"] = result["operational"]
my_dict["PUBLISH_TO_MSI"] = result["publishToMSI"]
my_dict["TEMPLATE"] = result["template"]
my_dict["TEXT_ONLY"] = result["textOnly"]
my_dict["URL"] = result.get("url")
my_dict["WEBVIEW"] = result["webView"]
my_dict["CREATED_AT"] = result["createdAt"]
my_dict["UPDATED_AT"] = result["updatedAt"]
my_dict["WORKSPACE"] = result["workspace"]
my_dict_list.append(my_dict)
csv_columns = ["REQUEST_ID","SUCCESS","RESULT_ID","NAME","DESCRIPTION","STATUS","FOLDER_TYPE","FOLDER_ID","FOLDER_NAME","FROM_EMAIL_TYPE","FROM_EMAIL_VALUE","FROM_NAME_TYPE","FROM_NAME_VALUE","REPLY_EMAIL_TYPE","REPLY_EMAIL_VALUE","SUBJECT_TYPE","SUBJECT_VALUE","OPERATIONAL","PUBLISH_TO_MSI","TEMPLATE","TEXT_ONLY","URL","WEBVIEW","CREATED_AT","UPDATED_AT","WORKSPACE"]
with open(exp_path,'wb') as csvfile:
xz = csv.DictWriter(csvfile,fieldnames=csv_columns)
headers = {}
for n in xz.fieldnames:
headers[n] = n
xz.writerow(headers)
for data in my_dict_list:
xz.writerow(data)
except Exception as exception:
print("Please check the logs. JSON to CSV conversion failed for Emails: ", exception)
答案 0 :(得分:1)
看这里:
my_dict_list =[]
try:
for f in os.listdir(file_dir):
if f.endswith('.json') and f.startswith('emails_'):
file_path = os.path.join(file_dir, f)
data = open(file_path, 'r')
for line in data:
my_dict = {}
parsed_data = json.loads(line)
# ...
for result in parsed_data["result"]:
# ...
my_dict_list.append(my_dict)
my_dict
是仅在文件行级更新的字典。但是您想要做的似乎是parsed_data["result"]
的每个元素。如果将相同的字典添加到循环内的列表中并对其进行突变,则实际上是在将多个相同的副本放入列表中,并且在进行突变时,会对所有副本进行突变。 (“复制”在Python中是个坏词,因为它们实际上只是引用)
要解决您的问题,请尝试替换以下内容:
my_dict_list.append(my_dict)
与此:
my_dict_list.append(dict(my_dict))
这将在复制到列表之前先进行(浅)复制。
答案 1 :(得分:1)
这是Python中的常见陷阱。这里重要的是my_dict是指向dict的指针。
这里发生的是,您正在定义my_dict(指向dict的指针),使用一组值对其进行更新,然后将其附加到列表中。然后,在循环的第二次迭代中,更改my_dict的值并将其附加到数组中的第二个位置。但是,my_dict也在数组的第一个位置。因此,现在可以在数组的索引0和索引1中更新my_dict的值。
因此,列表中的每个字典中的所有值都会更新,而不仅仅是ID。这种情况一直持续到循环的最后一次迭代,此时列表中的所有条目(它们都是my_dict)都更新为结果中最后一个dict的值。
解决此问题的一种方法是在每次迭代中定义一个新的dict。
for line in data:
parsed_data = json.loads(line)
for result in parsed_data["result"]:
my_dict = {}
my_dict["REQUEST_ID"] = parsed_data["requestId"]
my_dict["SUCCESS"] = parsed_data["success"]
my_dict["RESULT_ID"] = result["id"]
my_dict["NAME"] = result["name"]
my_dict["DESCRIPTION"] = result.get("description")
my_dict["STATUS"] = result["status"].encode('utf-8')
my_dict["FOLDER_TYPE"] = result["folder"]["type"]
my_dict["FOLDER_ID"] = result["folder"]["value"]
my_dict["FOLDER_NAME"] = result["folder"]["folderName"]
my_dict["FROM_EMAIL_TYPE"] = result["fromEmail"]["type"]
my_dict["FROM_EMAIL_VALUE"] = result["fromEmail"]["value"]
my_dict["FROM_NAME_TYPE"] = result["fromName"]["type"]
my_dict["FROM_NAME_VALUE"] = result["fromName"]["value"]
my_dict["REPLY_EMAIL_TYPE"] = result["replyEmail"]["type"]
my_dict["REPLY_EMAIL_VALUE"] = result["replyEmail"]["value"]
my_dict["SUBJECT_TYPE"] = result["subject"]["type"]
my_dict["SUBJECT_VALUE"] = result["subject"]["value"]
my_dict["OPERATIONAL"] = result["operational"]
my_dict["PUBLISH_TO_MSI"] = result["publishToMSI"]
my_dict["TEMPLATE"] = result["template"]
my_dict["TEXT_ONLY"] = result["textOnly"]
my_dict["URL"] = result.get("url")
my_dict["WEBVIEW"] = result["webView"]
my_dict["CREATED_AT"] = result["createdAt"]
my_dict["UPDATED_AT"] = result["updatedAt"]
my_dict["WORKSPACE"] = result["workspace"]
my_dict_list.append(my_dict)
答案 2 :(得分:0)
为什么从文件中读取行,因为每个文件已经是一行了?
这部分:
data = open(file_path, 'r')
for line in data:
my_dict = {}
parsed_data = json.loads(line)
可以简化为:
my_dict = {}
parsed_data = json.loads(open(file_path, 'r').read())