我创建了一个函数,负责解析JSON对象,提取有用的字段并创建Pandas数据框。
def parse_metrics_to_df(metrics):
def extract_details(row):
row['trial'] = row['agent']['trial']
row['numerosity'] = row['agent']['numerosity']
row['reliable'] = row['agent']['reliable']
row['was_correct'] = row['performance']['was_correct']
return row
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
df.set_index('trial', inplace=True)
return df
metrics
是一个类似于(前两行)的JSON文档数组:
[{'agent': {'fitness': 25.2375,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 1,
'trial': 0},
'environment': None,
'performance': {'was_correct': True}},
{'agent': {'fitness': 23.975625,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 2,
'trial': 1},
'environment': None,
'performance': {'was_correct': False}}]
然后按如下方式执行:
df = parse_metrics_to_df(metrics)
代码按预期工作,但速度极慢。用一百万个对象解析数组需要将近1个小时。
如何正确地做到这一点?
答案 0 :(得分:1)
操纵Series
对象是瓶颈。从Series
创建新dict
的速度要快得多。
import pandas as pd
metrics = [{'agent': {'fitness': 25.2375,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 1,
'trial': 0},
'environment': None,
'performance': {'was_correct': True}},
{'agent': {'fitness': 23.975625,
'numerosity': 1,
'population': 1,
'reliable': 0,
'steps': 1,
'total_steps': 2,
'trial': 1},
'environment': None,
'performance': {'was_correct': False}}]
thousand_metrics = metrics * 1000
def parse_metrics_to_df(metrics):
def extract_details(row):
row['trial'] = row['agent']['trial']
row['numerosity'] = row['agent']['numerosity']
row['reliable'] = row['agent']['reliable']
row['was_correct'] = row['performance']['was_correct']
return row
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
df.set_index('trial', inplace=True)
return df
%timeit df = parse_metrics_to_df(thousand_metrics)
# 4.06 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
def parse_metrics_to_df2(metrics):
def extract_details(row):
res = {}
res['trial'] = row['agent']['trial']
res['numerosity'] = row['agent']['numerosity']
res['reliable'] = row['agent']['reliable']
res['was_correct'] = row['performance']['was_correct']
return pd.Series(res)
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.set_index('trial', inplace=True)
return df
df = parse_metrics_to_df2(thousand_metrics)
df2 = parse_metrics_to_df2(thousand_metrics)
df.equals(df2) # True
%timeit df2 = parse_metrics_to_df2(thousand_metrics)
# 566 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
现在,快了7倍。
答案 1 :(得分:1)
你应该通过使用简单的列表理解来看到一个重要的(对我来说是~9倍)。
通常pd.DataFrame
会产生开销,可以通过在将数据放入数据框之前执行操作来避免这些开销。
def parse_metrics_to_df(metrics):
def extract_details(row):
row['trial'] = row['agent']['trial']
row['numerosity'] = row['agent']['numerosity']
row['reliable'] = row['agent']['reliable']
row['was_correct'] = row['performance']['was_correct']
return row
df = pd.DataFrame(metrics)
df = df.apply(extract_details, axis=1)
df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
df.set_index('trial', inplace=True)
return df
def jp(metrics):
lst = [[d['agent']['trial'], d['agent']['numerosity'], d['agent']['reliable'],
d['performance']['was_correct']] for d in metrics]
df = pd.DataFrame(lst, columns=['trial', 'agent', 'environment', 'performance'])
df = df.set_index('trial')
return df
%timeit parse_metrics_to_df(metrics) # 14.4 ms
%timeit jp(metrics) # 1.6 ms