我有一个自定义函数,可将数据帧转换为numpy的ndarray。数据框有超过2100万行,目标数组的形状为(m, 1, 100,4)
。
此操作需要很长时间才能完成。目前,此功能仍将运行近2个小时。我正在使用具有16GB RAM的Ubuntu 18.04(尽管我几乎不会在2-3天内关闭/重新启动系统)。
有没有一种方法可以加快此操作的速度?多重处理能否提高性能?我正在使用Jupyter笔记本电脑。
编辑
df.head()
+---+-----------+-------+--------------+-----------+----------------+------------+
| | id | speed | acceleration | jerk | bearing_change | travelmode |
+---+-----------+-------+--------------+-----------+----------------+------------+
| 0 | 533815001 | 17.63 | 0.000000 | -0.000714 | 209.028008 | 3 |
| 1 | 533815001 | 17.63 | -0.092872 | 0.007090 | 56.116237 | 3 |
| 2 | 533815001 | 0.17 | 1.240000 | -2.040000 | 108.494680 | 3 |
| 3 | 533815001 | 1.41 | -0.800000 | 0.510000 | 11.847480 | 3 |
| 4 | 533815001 | 0.61 | -0.290000 | 0.150000 | 36.7455703 | 3 |
+---+-----------+-------+--------------+-----------+----------------+------------+
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 21713545 entries, 0 to 21713544
Data columns (total 6 columns):
# Column Dtype
--- ------ -----
0 id int64
1 speed float64
2 acceleration float64
3 jerk float64
4 bearing_change float64
5 travelmode int64
dtypes: float64(4), int64(2)
memory usage: 994.0 MB
这是我的自定义函数,可将df
转换为numpy的ndarray:
def transform(dataframe, chunk_size=5):
grouped = dataframe.groupby('id')
# initialize accumulators
X, y = np.zeros([0, 1, chunk_size, 4]), np.zeros([0,])
# loop over each group (df[df.id==1] and df[df.id==2])
for _, group in grouped:
inputs = group.loc[:, 'speed':'bearing_change'].values
label = group.loc[:, 'travelmode'].values[0]
# calculate number of splits
N = (len(inputs)-1) // chunk_size
if N > 0:
inputs = np.array_split(
inputs, [chunk_size + (chunk_size*i) for i in range(N)])
else:
inputs = [inputs]
# loop over splits
for inpt in inputs:
inpt = np.pad(
inpt, [(0, chunk_size-len(inpt)),(0, 0)],
mode='constant')
# add each inputs split to accumulators
X = np.concatenate([X, inpt[np.newaxis, np.newaxis]], axis=0)
y = np.concatenate([y, label[np.newaxis]], axis=0)
return X, y
因此,这样做需要将近2个小时,因此必须取消操作
Input, Label = transform(df, 100)