Question

我想将我的 dataframe 转换为来自每个唯一段的固定大小块的数组。具体来说，我想将 df 转换为 m 数组的列表，每个数组的大小为 (1,100,4)。所以最后，我会有一个 (m,1,100,4) 数组。

由于我要求 chunks 是固定大小的 (1,100,4)，并且在拆分时每个段不太可能完美地产生此大小，因此段的最后一行通常较少，因此应该零填充。

为此，我开始创建一个这样大小的数组，并用全零填充它。然后用 df 行逐渐填充这些值。通过这种方式，特定段末尾的内容因此被零填充。

为此，我使用函数：

def transform(dataframe, chunk_size):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:

        inputs = group.loc[:, 'A':'D'].values
        label = group.loc[:, 'class'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            inputs = np.array_split(
                 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

此函数确实产生预期的 ndarray。但是，它非常缓慢。我的 df 有超过 2100 万行，所以这个功能需要 5 多个小时才能完成，这太疯狂了！

我正在寻找一种方法来重构此函数以进行优化。

重现问题的步骤：

生成一个随机的大df：

import pandas as pd
import numpy as np
import time

df = pd.DataFrame(np.random.randn(3_000_000,4), columns=list('ABCD'))
df['class'] = np.random.randint(0, 5, df.shape[0])
df.shape
(3000000, 5)

df['id'] = df.index // 650 +1
df.head()

       A           B           C           D    class   id
0   -0.696659   -0.724940   0.494385    1.469749    2   1
1   -0.440400   0.744680    -0.684663   -1.962713   4   1
2   -1.207888   -1.003556   -0.926677   -1.455632   3   1
3   1.575943    -0.453352   -0.106494   0.351674    3   1
4   0.888164    0.675754    0.254067    -0.454150   3   1

将 df 转换为每个唯一分段所需的 ndarray。

start = time.time()
X, y = transform(df, 100)
end = time.time()

print(f"Execution time: {(end - start) / 60}")
Execution time: 6.169370893637339

对于 5M 行 df，此功能需要 6 分钟以上才能完成。在我的情况下（> 21M 行），需要几个小时！！！

提高速度的函数是怎么写的？也许创建累加器的想法是完全错误的。

如何优化此功能以获得更好的性能？

0 个答案: