Question

我有一个自定义函数，可将数据帧转换为numpy的ndarray。数据框有超过2100万行，目标数组的形状为(m, 1, 100,4)。

此操作需要很长时间才能完成。目前，此功能仍将运行近2个小时。我正在使用具有16GB RAM的Ubuntu 18.04（尽管我几乎不会在2-3天内关闭/重新启动系统）。

有没有一种方法可以加快此操作的速度？多重处理能否提高性能？我正在使用Jupyter笔记本电脑。

编辑

df.head()
+---+-----------+-------+--------------+-----------+----------------+------------+
|   |    id     | speed | acceleration |   jerk    | bearing_change | travelmode |
+---+-----------+-------+--------------+-----------+----------------+------------+
| 0 | 533815001 | 17.63 | 0.000000     | -0.000714 | 209.028008     |          3 |
| 1 | 533815001 | 17.63 | -0.092872    | 0.007090  | 56.116237      |          3 |
| 2 | 533815001 | 0.17  | 1.240000     | -2.040000 | 108.494680     |          3 |
| 3 | 533815001 | 1.41  | -0.800000    | 0.510000  | 11.847480      |          3 |
| 4 | 533815001 | 0.61  | -0.290000    | 0.150000  | 36.7455703     |          3 |
+---+-----------+-------+--------------+-----------+----------------+------------+

df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21713545 entries, 0 to 21713544
Data columns (total 6 columns):
 #   Column          Dtype  
---  ------          -----  
 0   id              int64  
 1   speed           float64
 2   acceleration    float64
 3   jerk            float64
 4   bearing_change  float64
 5   travelmode      int64  
dtypes: float64(4), int64(2)
memory usage: 994.0 MB

这是我的自定义函数，可将df转换为numpy的ndarray：

def transform(dataframe, chunk_size=5):
    
    grouped = dataframe.groupby('id')

    # initialize accumulators
    X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])

    # loop over each group (df[df.id==1] and df[df.id==2])
    for _, group in grouped:

        inputs = group.loc[:, 'speed':'bearing_change'].values
        label = group.loc[:, 'travelmode'].values[0]

        # calculate number of splits
        N = (len(inputs)-1) // chunk_size

        if N > 0:
            inputs = np.array_split(
                 inputs, [chunk_size + (chunk_size*i) for i in range(N)])
        else:
            inputs = [inputs]

        # loop over splits
        for inpt in inputs:
            inpt = np.pad(
                inpt, [(0, chunk_size-len(inpt)),(0, 0)], 
                mode='constant')
            # add each inputs split to accumulators
            X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
            y = np.concatenate([y, label[np.newaxis]], axis=0) 

    return X, y

因此，这样做需要将近2个小时，因此必须取消操作

Input, Label = transform(df, 100)

使阵列操作高效

0 个答案: