我希望优化下面的代码,这需要大约5秒,这对于只有1000行的文件来说太慢了。
我有一个大文件,其中每行包含有效的JSON,每个JSON看起来如下(实际数据要大得多并且嵌套,所以我使用这个JSON片段进行说明):
{"location":{"town":"Rome","groupe":"Advanced",
"school":{"SchoolGroupe":"TrowMet", "SchoolName":"VeronM"}},
"id":"145",
"Mother":{"MotherName":"Helen","MotherAge":"46"},"NGlobalNote":2,
"Father":{"FatherName":"Peter","FatherAge":"51"},
"Teacher":["MrCrock","MrDaniel"],"Field":"Marketing",
"season":["summer","spring"]}
我需要解析这个文件,以便只从每个JSON中提取一些键值,以获得结果数据帧:
Groupe Id MotherName FatherName
Advanced 56 Laure James
Middle 11 Ann Nicolas
Advanced 6 Helen Franc
但是我在数据帧中需要的一些键在某些JSON对象中缺失,因此我应该验证该键是否存在,如果不存在,则用Null填充相应的值。我使用以下方法:
df = pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open (path/to/file) as f:
for chunk in f:
jfile = json.loads(chunk)
if 'groupe' in jfile['location']:
groupe = jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id = jfile['id']
else:
id = np.nan
if 'MotherName' in jfile['Mother']:
MotherName = jfile['Mother']['MotherName']
else:
MotherName = np.nan
if 'FatherName' in jfile['Father']:
FatherName = jfile['Father']['FatherName']
else:
FatherName = np.nan
df = df.append({"groupe":group, "id":id, "MotherName":MotherName, "FatherName":FatherName},
ignore_index=True)
我需要将整个1000行文件的运行时间优化为< = 2秒。在PERL中,相同的解析功能需要< 1秒,但我需要在Python中实现它。
答案 0 :(得分:1)
关键部分不是将每一行追加到循环中的数据帧。您希望将集合保留在列表或dict容器中,然后立即连接所有这些集合。您还可以使用简单的if/else
来简化get
结构,如果在字典中找不到该项,则会返回默认值(例如np.nan)。
with open (path/to/file) as f:
d = {'group': [], 'id': [], 'Father': [], 'Mother': []}
for chunk in f:
jfile = json.loads(chunk)
d['groupe'].append(jfile['location'].get('groupe', np.nan))
d['id'].append(jfile.get('id', np.nan))
d['MotherName'].append(jfile['Mother'].get('MotherName', np.nan))
d['FatherName'].append(jfile['Father'].get('FatherName', np.nan))
df = pd.DataFrame(d)
答案 1 :(得分:1)
如果您可以在初始化期间一步构建数据帧,那么您将获得最佳性能。 DataFrame.from_record
采用一系列元组,您可以从一次读取一条记录的生成器提供这些元组。您可以使用get
更快地解析数据,这将在找不到项目时提供默认参数。我创建了一个名为dict
的空dummy
来传递中间get
,以便您知道链式获取将起作用。
我创建了一个1000记录数据集,在我糟糕的笔记本电脑上,时间从18秒到.06秒。多数民众赞成。
import numpy as np
import pandas as pd
import json
import time
def extract_data(data):
""" convert 1 json dict to records for import"""
dummy = {}
jfile = json.loads(data.strip())
return (
jfile.get('location', dummy).get('groupe', np.nan),
jfile.get('id', np.nan),
jfile.get('Mother', dummy).get('MotherName', np.nan),
jfile.get('Father', dummy).get('FatherName', np.nan))
start = time.time()
df = pd.DataFrame.from_records(map(extract_data, open('file.json')),
columns=['group', 'id', 'Father', 'Mother'])
print('New algorithm', time.time()-start)
#
# The original way
#
start= time.time()
df=pd.DataFrame(columns=['group', 'id', 'Father', 'Mother'])
with open ('file.json') as f:
for chunk in f:
jfile=json.loads(chunk)
if 'groupe' in jfile['location']:
groupe=jfile['location']['groupe']
else:
groupe=np.nan
if 'id' in jfile:
id=jfile['id']
else:
id=np.nan
if 'MotherName' in jfile['Mother']:
MotherName=jfile['Mother']['MotherName']
else:
MotherName=np.nan
if 'FatherName' in jfile['Father']:
FatherName=jfile['Father']['FatherName']
else:
FatherName=np.nan
df = df.append({"groupe":groupe,"id":id,"MotherName":MotherName,"FatherName":FatherName},
ignore_index=True)
print('original', time.time()-start)