优化python数据框

时间:2019-03-19 16:27:25

标签: python pandas

我有以下代码需要800毫秒才能执行,但是数据不是那么多..只有很少的列和很少的行 是否有机会使其更快,我真的不知道该代码中的bottelneck在哪里

def compute_s_t(df,
                gb=('session_time', 'trajectory_id'),
                params=('t', 's', 's_normalized', 'v_direct', 't_abs', ),
                fps=25, inplace=True):

    if not inplace:
        df = df.copy()

    orig_columns = df.columns.tolist()
    # compute travelled distance
    df['dx'] = df['x_world'].diff()
    df['dy'] = df['y_world'].diff()
    t1 = datetime.datetime.now()

    df['ds'] = np.sqrt(np.array(df['dx'] ** 2 + df['dy'] ** 2, dtype=np.float32))


    df['ds'].iloc[0] = 0  # to avoid NaN returned by .diff()
    df['s'] = df['ds'].cumsum()
    df['s'] = (df.groupby('trajectory_id')['s']
                 .transform(subtract_nanmin))

    # compute travelled time
    df['dt'] = df['frame'].diff() / fps
    df['dt'].iloc[0] = 0  # to avoid NaN returned by .diff()
    df['t'] = df['dt'].cumsum()
    df['t'] = (df.groupby('trajectory_id')['t']
                 .transform(subtract_nanmin))
    df['t_abs'] = df['frame'] / fps
    # compute velocity
    # why values[:, 0]? why duplicate column?
    df['v_direct'] = df['ds'].values / df['dt'].values
    df.loc[df['t'] == 0, 'v'] = np.NaN

    # compute normalized s
    df['s_normalized'] = (df.groupby('trajectory_id')['s']
                            .transform(divide_nanmax))

    # skip intermediate results
    cols = orig_columns + list(params)
    t2 = datetime.datetime.now()

    print((t2 - t1).microseconds / 1000)


    return df[cols]

这是探查器的输出:

     18480 function calls (18196 primitive calls) in 0.593 seconds

订购者:通话次数

  ncalls  tottime  percall  cumtime  percall filename:lineno(function)

       11    0.000    0.000    0.580    0.053 frame.py:3105(__setitem__)
       11    0.000    0.000    0.000    0.000 frame.py:3165(_ensure_valid_index)
       11    0.000    0.000    0.580    0.053 frame.py:3182(_set_item)
       11    0.000    0.000    0.000    0.000 frame.py:3324(_sanitize_column)
       11    0.000    0.000    0.003    0.000 generic.py:2599(_set_item)
       11    0.000    0.000    0.577    0.052 generic.py:2633(_check_setitem_copy)
       11    0.000    0.000    0.000    0.000 indexing.py:2321(convert_to_index_sliceable)

根据评论,我使用了探查器,并将函数的分析结果放在上面。

def subtract_nanmin(x):
    return x - np.nanmin(x)


def divide_nanmax(x):
    return x / np.nanmax(x)

1 个答案:

答案 0 :(得分:0)

要做的一件事是替换:

df.columns.tolist()

使用

df.columns.values.tolist()

这要快得多。这是一个随机100x100数据帧的实验:

%timeit df.columns.values.tolist()
     

输出:

1.29 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
     

并使用相同的df:

%timeit df.columns.tolist()
     

输出:

6.91 µs ± 241 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

更新:

subtract_nanmindivide_nanmax是什么?

代替

df['ds'].iloc[0] = 0  # to avoid NaN returned by .diff()
df['dt'].iloc[0] = 0  # to avoid NaN returned by .diff()

您可以使用df.fillna(0)df['ds'].fillna(0)来摆脱NaNs