优化Python Pandas中的“apply”

时间:2018-03-26 13:34:59

标签: python pandas

我创建了一个函数,负责解析JSON对象,提取有用的字段并创建Pandas数据框。

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df

metrics是一个类似于(前两行)的JSON文档数组:

[{'agent': {'fitness': 25.2375,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 1,
   'trial': 0},
  'environment': None,
  'performance': {'was_correct': True}},
 {'agent': {'fitness': 23.975625,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 2,
   'trial': 1},
  'environment': None,
  'performance': {'was_correct': False}}]

然后按如下方式执行:

df = parse_metrics_to_df(metrics)

enter image description here

代码按预期工作,但速度极慢。用一百万个对象解析数组需要将近1个小时。

如何正确地做到这一点?

2 个答案:

答案 0 :(得分:1)

操纵Series对象是瓶颈。从Series创建新dict的速度要快得多。

设置

import pandas as pd

metrics = [{'agent': {'fitness': 25.2375,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 1,
   'trial': 0},
  'environment': None,
  'performance': {'was_correct': True}},
 {'agent': {'fitness': 23.975625,
   'numerosity': 1,
   'population': 1,
   'reliable': 0,
   'steps': 1,
   'total_steps': 2,
   'trial': 1},
  'environment': None,
  'performance': {'was_correct': False}}]
thousand_metrics = metrics * 1000 

原始代码

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df

%timeit df = parse_metrics_to_df(thousand_metrics)

# 4.06 s ± 87.3 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

修改后的代码

def parse_metrics_to_df2(metrics):
    def extract_details(row):
        res = {}
        res['trial'] = row['agent']['trial']
        res['numerosity'] = row['agent']['numerosity']
        res['reliable'] = row['agent']['reliable']
        res['was_correct'] = row['performance']['was_correct']
        return pd.Series(res)

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.set_index('trial', inplace=True)

    return df

df = parse_metrics_to_df2(thousand_metrics)
df2 = parse_metrics_to_df2(thousand_metrics)
df.equals(df2) # True

%timeit df2 = parse_metrics_to_df2(thousand_metrics)

# 566 ms ± 7.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

现在,快了7倍。

答案 1 :(得分:1)

你应该通过使用简单的列表理解来看到一个重要的(对我来说是~9倍)。

通常pd.DataFrame会产生开销,可以通过在将数据放入数据框之前执行操作来避免这些开销。

def parse_metrics_to_df(metrics):
    def extract_details(row):
        row['trial'] = row['agent']['trial']
        row['numerosity'] = row['agent']['numerosity']
        row['reliable'] = row['agent']['reliable']
        row['was_correct'] = row['performance']['was_correct']
        return row

    df = pd.DataFrame(metrics)
    df = df.apply(extract_details, axis=1)
    df.drop(['agent', 'environment', 'performance'], axis=1, inplace=True)
    df.set_index('trial', inplace=True)

    return df


def jp(metrics):

    lst = [[d['agent']['trial'], d['agent']['numerosity'], d['agent']['reliable'],
            d['performance']['was_correct']] for d in metrics]

    df = pd.DataFrame(lst, columns=['trial', 'agent', 'environment', 'performance'])
    df = df.set_index('trial')

    return df

%timeit parse_metrics_to_df(metrics)   # 14.4 ms
%timeit jp(metrics)                    # 1.6 ms