我正在使用非嵌套的json文件,数据来自reddit。我试图使用python将其转换为csv文件。每行没有相同的字段,因此不断收到错误:
JSONDecodeError: Extra data: line 2 column 1
以下是代码:
import csv
import json
import os
os.chdir('c:\\Users\\Desktop')
infile = open("data.json", "r")
outfile = open("outputfile.csv", "w")
writer = csv.writer(outfile)
for row in json.loads(infile.read()):
writer.writerow(row)
以下是数据中的几行:
{"author":"i_had_an_apostrophe","body":"\"It's not your fault.\"","author_flair_css_class":null,"link_id":"t3_5c0rn0","subreddit":"AskReddit","created_utc":1478736000,"subreddit_id":"t5_2qh1i","parent_id":"t1_d9t3q4d","author_flair_text":null,"id":"d9tlp0j"}
{"id":"d9tlp0k","author_flair_text":null,"parent_id":"t1_d9tame6","link_id":"t3_5c1efx","subreddit":"technology","created_utc":1478736000,"subreddit_id":"t5_2qh16","author":"willliam971","body":"9/11 inside job??","author_flair_css_class":null}
{"created_utc":1478736000,"subreddit_id":"t5_2qur2","link_id":"t3_5c44bz","subreddit":"excel","author":"excelevator","author_flair_css_class":"points","body":"Have you tried stepping through the code to analyse the values at each step?\n\n","author_flair_text":"442","id":"d9tlp0l","parent_id":"t3_5c44bz"}
{"created_utc":1478736000,"subreddit_id":"t5_2tycb","link_id":"t3_5c384j","subreddit":"OldSchoolCool","author":"10minutes_late","author_flair_css_class":null,"body":"**Thanks Hillary**","author_flair_text":null,"id":"d9tlp0m","parent_id":"t3_5c384j"}
我正在考虑获取csv文件中可用的所有字段(作为标题),如果该特定字段的数据不可用,只需用NA填充它。
答案 0 :(得分:1)
您的问题缺少有关您尝试完成的内容的信息,因此我猜测了这些信息。请注意,csv文件不能使用" nulls"为了表示缺少的字段,它们只有分隔符,它们之间没有任何内容,例如没有第三个字段值的1,2,,4,5
。
另外,如何打开csv文件varys取决于您是否使用Python 2或3.下面的代码适用于Python 3.
#!/usr/bin/env python3
import csv
import json
import os
os.chdir('c:\\Users\\Desktop')
with open('sampledata.json', 'r', newline='') as infile:
data = json.loads(infile.read())
# determine all the keys present, which will each become csv fields
fields = list(set(key for row in data for key in row))
with open('outputfile.csv', 'w', newline='') as outfile:
writer = csv.DictWriter(outfile, fields)
writer.writeheader()
writer.writerows(row for row in data)
答案 1 :(得分:0)
您可以编写一个小函数来为您构建行,仅在可用的位置提取数据,如果不是,则插入None。你称之为标题,我称之为模式。获取所有字段,删除重复项并排序,然后根据完整的字段集构建记录,并将这些记录插入到csv中。
import csv
import json
def build_record(row, schema):
values = []
for field in schema:
if field in row:
values.append(row[field])
else:
values.append(None)
return tuple(values)
infile = open("data.json", "r").readlines()
outfile = open("outputfile.csv", "wb")
writer = csv.writer(outfile)
rows = [json.loads(row.strip()) for row in infile]
schema = tuple(sorted(list(set([k for r in rows for k in r.keys()]))))
records = [build_record(r, schema) for r in rows]
writer.writerow(schema)
for rec in records:
writer.writerow(rec)
outfile.close()
答案 2 :(得分:0)
您可以使用Pandas
为您填写空白(您可能需要先pip install pandas
):
import pandas as pd
import os
# load json
os.chdir('c:\\Users\\Desktop')
with open("data.json", "r") as infile:
# read data into a Pandas DataFrame
df = pd.read_json(infile)
# use Pandas to write to CSV
df.to_csv("myfile.csv")
答案 3 :(得分:0)
我建议您使用csv.DictWriter
课程。该类需要一个文件写入和一个字段名列表(我已经从你的数据样本中找到了)。
import csv
import json
import os
fieldnames = [
"author", "author_flair_css_class", "author_flair_text", "body",
"created_utc", "id", "link_id", "parent_id", "subreddit",
"subreddit_id"
]
os.chdir('c:\\Users\\Desktop')
with open("data.json", "r") as infile:
outfile = open("outputfile.csv", "w")
writer = csv.DictWriter(outfile, fieldnames=fieldnames)
writer.writeheader()
for row in infile:
row_dict = json.loads(row)
writer.writerow(row_dict)
outfile.close()