我有一个.json文件,其中每一行都是一个对象。例如,前两行是:
{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}
我尝试使用ijson lib进行如下处理:
with open(filename, 'r') as f:
objects = ijson.items(f, 'columns.items')
columns = list(objects)
但是,我得到了错误:
JSONError: Additional data
这似乎是由于多个对象,我收到这样的错误。
在Jupyter中分析此类Json文件的推荐方法是什么?
提前谢谢
答案 0 :(得分:2)
每行本身都是有效的JSON,而整个文件则不是。因此,您无法一次性解析它,您将不得不遍历每一行,将其解析为一个对象。
您可以将这些对象聚合到一个列表中,然后从那里进行数据处理:
import json
with open(filename, 'r') as f:
object_list = []
for line in f.readlines():
object_list.append(json.loads(line))
# object_list will contain all of your file's data
您可以将其作为列表解析来使其具有更多的pythonic:
with open(filename, 'r') as f:
object_list = [json.loads(line)
for line in f.readlines()]
# object_list will contain all of your file's data
答案 1 :(得分:2)
如果这是完整的文件,则文件格式不正确。大括号之间必须有一个逗号,并且应以方括号开头和结尾。像这样:[{...},{...}]
。对于您的数据,它看起来像:
[{"review_id":"x7mDIiDB3jEiPGPHOmDzyw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...},
{"review_id":"dDl8zu1vWPdKGihJrwQbpw","user_id":"msQe1u7Z_XuqjGoqhB0J5g","business_id": ...}]
以下是一些如何清理文件的代码:
lastline = None
with open("yourfile.json","r") as f:
lineList = f.readlines()
lastline=lineList[-1]
with open("yourfile.json","r") as f, open("cleanfile.json","w") as g:
for i,line in enumerate(f,0):
if i == 0:
line = "["+str(line)+","
g.write(line)
elif line == lastline:
g.write(line)
g.write("]")
else:
line = str(line)+","
g.write(line)
要正确读取json文件,您还可以考虑使用熊猫库(https://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html)。
import pandas as pd
#get a pandas dataframe object from json file
df = pd.read_json("path/to/your/filename.json")
如果您不熟悉熊猫,请快速入门,了解如何使用数据框对象:
df.head() #gives you the first rows of the dataframe
df["review_id"] # gives you the column review_id as a vector
df.iloc[1,:] # gives you the complete row with index 1
df.iloc[1,2] # gives you the item in row with index 1 and column with index 2
答案 2 :(得分:1)
文件中有多行,所以这就是它引发错误的原因
import json
with open(filename, 'r') as f:
lines = f.readlines()
first = json.loads(lines[0])
second = json.loads(lines[1])
这应该抓住两条线并正确加载它们