加速熊猫聚合

时间:2014-05-08 15:15:57

标签: python performance optimization pandas

我正在尝试计算pandas DataFrame中的重复行数。我从一个看起来像这个

的csv文件中读取数据
feature, IV, IT
early/J_result/N, True, False
early/J_result/N, True, False
early/J_result/N, True, False
excellent/J_result/N, True, True
hillsdown/N, True, False
hillsdown/N, True, False

上面示例输入的所需输出是

feature, IV, IT, count
early/J_result/N, True, False, 3
excellent/J_result/N, True, True, 1
hillsdown/N, True, False, 2

我现在的代码是:

import pandas as pd
def sum_up_token_counts(hdf_file):
    df = pd.read_csv(csv_file, sep=', ')
    counts = df.groupby('feature').count().feature
    assert counts.sum() == df.shape[0]  # no missing rows
    df = df.drop_duplicates()
    df.set_index('feature', inplace=True)
    df['count'] = counts
    return df

这可以按预期工作,但需要很长时间。我描述了它,看起来几乎所有的时间都花在分组和计数上。

Total time: 4.43439 s

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    28                                           
    29         1        57567  57567.0      1.3      df = pd.read_csv(hdf_file, sep=', ')
    30         1      4368529 4368529.0     98.5      counts = df.groupby('feature').count().feature
    31         1          174    174.0      0.0      assert counts.sum() == df.shape[0]  # no missing rows
    32         1         6234   6234.0      0.1      df = df.drop_duplicates()
    33         1          501    501.0      0.0      df.set_index('feature', inplace=True)
    34         1         1377   1377.0      0.0      df['count'] = counts
    35         1            1      1.0      0.0      return df

有什么想法可以加快这段代码的速度吗?

1 个答案:

答案 0 :(得分:2)

使用master / 0.14(即将推出),大大加快了计数,请参阅here

这是master / 0.14 vs 0.13.1的基准:

设置

In [1]: n = 10000

In [2]: offsets = np.random.randint(n, size=n).astype('timedelta64[ns]')

In [3]: dates = np.datetime64('now') + offsets

In [4]: dates[np.random.rand(n) > 0.5] = np.datetime64('nat')

In [5]: offsets[np.random.rand(n) > 0.5] = np.timedelta64('nat')

In [6]: value2 = np.random.randn(n)

In [7]: value2[np.random.rand(n) > 0.5] = np.nan

In [8]: obj = pd.util.testing.choice(['a', 'b'], size=n).astype(object)

In [9]: obj[np.random.randn(n) > 0.5] = np.nan

In [10]: df = DataFrame({'key1': np.random.randint(0, 500, size=n),
   ....:                 'key2': np.random.randint(0, 100, size=n),
   ....:                 'dates': dates,
   ....:                 'value2' : value2,
   ....:                 'value3' : np.random.randn(n),
   ....:                 'obj': obj,
   ....:                 'offsets': offsets})

v0.13.1

In [11]: %timeit df.groupby(['key1', 'key2']).count()
1 loops, best of 3: 5.41 s per loop

v0.14.0

In [11]: %timeit df.groupby(['key1', 'key2']).count()
100 loops, best of 3: 6.25 ms per loop