Question

我有一个形状（RxC）1.5M x 128的数据帧。我执行以下操作：

我基于6列做groupby（）。这将创建~8700个子组，每个子组的形状为538 x 122。
在每个子组中，我运行apply（）。此函数计算子组中每个分类值PER列（即122）的％频率。

所以我的（pesudo）代码：

<df = Read dataframe from file> g = df.groupby(grp_cols) g[nongrp_cols].apply(lambda d: d.apply(lambda s: s.value_counts()) / len(d.index))

代码对我来说工作正常，所以现在我正在分析它以提高性能。 apply（）函数运行大约需要20-25分钟。我认为问题是它在每列（122次）上迭代8700次（每个子组），这可能不是最好的方式（给定我编码的方式）。

有人可以推荐一些方法来加快速度吗？

我尝试使用python多处理池（8个进程）将子组划分为相同的集合进行处理，但最终得到了一些酸洗错误......

感谢。

Answer 1

pd.DataFrame.groupby.apply确实给了我们很大的灵活性（与agg / filter / transform不同，它允许你将每个子组重塑为任何形状，在你的情况下，从538 x 122到N_categories x 122）。但它确实带来了成本：逐个应用你的灵活功能，缺乏矢量化。

我仍然认为解决它的方法是使用多处理。您遇到的pickle错误很可能是因为您在multi_processing_function中定义了一些函数。规则是您必须在顶层移动所有功能。请参阅下面的代码。

import pandas as pd
import numpy as np

# simulate your data with int 0 - 9 for categorical values
df = pd.DataFrame(np.random.choice(np.arange(10), size=(538, 122)))
# simulate your groupby operations, not so cracy with 8700 sub-groups, just try 800 groups for illustration
sim_keys = ['ROW' + str(x) for x in np.arange(800)]
big_data = pd.concat([df] * 800, axis=0, keys=sim_keys)
big_data.shape

big_data.shape
Out[337]: (430400, 122)

# Without multiprocessing
# ===================================================
by_keys = big_data.groupby(level=0)

sample_group = list(by_keys)[0][1]
sample_group.shape

def your_func(g):
    return g.apply(lambda s: s.value_counts()) / len(g.index)

def test_no_multiprocessing(gb, apply_func):
    return gb.apply(apply_func)

%time result_no_multiprocessing = test_no_multiprocessing(by_keys, your_func)

CPU times: user 1min 26s, sys: 4.03 s, total: 1min 30s
Wall time: 1min 27

这里很慢。我们使用多处理模块：

# multiprocessing for pandas dataframe apply
# ===================================================
# to void pickle error, must define functions at TOP level, if we move this function 'process' into 'test_with_multiprocessing', it raises a pickle error
def process(df):
    return df.groupby(level=0).apply(your_func)

def test_with_multiprocessing(big_data, apply_func):

    import multiprocessing as mp

    p = mp.Pool(processes=8)
    # split it into 8 chunks
    split_dfs = np.array_split(big_data, 8, axis=0)
    # define the mapping function, wrapping it to take just df as input
    # apply to each chunk
    df_pool_results = p.map(process, split_dfs)

    p.close()

    # combine together
    result = pd.concat(df_pool_results, axis=0)

    return result


%time result_with_multiprocessing = test_with_multiprocessing(big_data, your_func)

CPU times: user 984 ms, sys: 3.46 s, total: 4.44 s
Wall time: 22.3 s

现在，速度更快，特别是在CPU时间。虽然当我们拆分并重新组合结果时会有一些开销，但是当使用8核处理器时，它预计比非多处理情况快4到6倍。

最后，检查两个结果是否相同。

import pandas.util.testing as pdt

pdt.assert_frame_equal(result_no_multiprocessing, result_with_multiprocessing)

完美地通过测试。

pandas：优化我的代码（groupby（）/ apply（））

1 个答案: