我的数据框大约有1200万行,使用for
循环时我需要提高性能,但是我不知道该怎么做。
我正在使用运行正常的Python / Pandas,但是速度非常慢。
对于每个时间戳,我需要让列num_pr
和num_pu
使用以下条件根据TempTable
的总和来计算。
TempTable = pd.DataFrame({'account': np.arange(1, 2), 'pr': 0, 'pu': 0})
TempTable = TempTable.set_index('account')
df['num_pr'] = 0
df['num_pu'] = 0
for row in range(0, 10000):
if (df.action[row] == 'SA' and df.status[row] == 'PR') or (df.action[row] == 'I' and df.status[row] == 'PR'):
TempTable.loc[df.account[row], 'pr'] = 1
elif (df.action[row] == 'SA' and df.status[row] == 'PU') or (df.action[row] == 'I' and df.status[row] == 'PU'):
TempTable.loc[df.account[row], 'pu'] = 1
elif (df.action[row] == 'SO' and df.status[row] == 'PR'):
TempTable.loc[df.account[row], 'pr'] = 0
elif (df.action[row] == 'SO' and df.status[row] == 'PU'):
TempTable.loc[df.account[row], 'pu'] = 0
df.loc[row, 'num_pr'] = TempTable.loc[:, 'pr'].sum()
df.loc[row, 'num_pu'] = TempTable.loc[:, 'pu'].sum()
account status timestamp status num_pr num_pu
0 1111111 SA 2018-06-28 02:00:01.024 PU 0 1
1 2222222 I 2018-06-28 02:00:02.032 PU 0 2
2 1111111 I 2018-06-28 02:00:03.382 PU 0 2
3 3333333 SO 2018-06-28 02:00:04.395 PR 0 2
4 1111111 I 2018-06-28 02:00:05.401 PU 0 2
5 1111111 I 2018-06-28 02:00:05.407 PU 0 2
6 2222222 I 2018-06-28 02:00:06.409 PU 0 2
7 3333333 SA 2018-06-28 02:00:06.413 PR 1 2
8 1111111 SO 2018-06-28 02:00:07.414 PU 1 1
9 3333333 SO 2018-06-28 02:00:07.467 PR 0 1
10 1111111 SA 2018-06-28 02:00:08.414 PR 1 1