我有以下函数,其中df是一个159538行x 3列的pandas数据帧:
dfs = []
for i in df['email_address']:
data = df[df['email_address'] == i]
data['difference'] = data['ts_placed'].diff().astype('timedelta64[D]')
repeat = []
for a in data['difference']:
if a > 10:
repeat.append(0)
elif a <= 10:
repeat.append(1)
else:
repeat.append(0)
data['repeat'] = repeat
dfs.append(data)
该功能运行速度极慢。我想通过使用多处理来加快这个过程。这个SO question显示了如何在R中执行此操作。什么是python的等效代码?
这是运行后的数据样本:
df['difference'] = df.groupby('email_address')['ts_placed'].diff()
df
Out[6]:
email_address ts_placed difference
0 aaaaaaaaaaaaa@sky.com 2015-08-06 00:00:34 NaT
1 dfdfdfdfdfd@babcock.co.uk 2015-08-06 00:05:38 NaT
2 littlemifddreen85@hotmail.co.uk 2015-08-06 00:09:20 NaT
3 smifdfddfms@aol.com 2015-08-06 00:10:01 NaT
4 terry.wfdfdfdfdfy-holdings.co.uk 2015-08-06 00:14:00 NaT
5 r.dfdfdfdfd16@hotmail.com 2015-08-06 00:14:00 NaT
6 kdfdfdf979@outlook.com 2015-08-06 00:14:00 NaT
7 dd@ggggggggggg.eclipse.co.uk 2015-08-06 00:14:20 NaT
8 gggz45@hotmail.co.uk 2015-08-06 00:14:43 NaT
9 gggggggggi@hotmail.co.uk 2015-08-06 00:17:03 NaT
10 mggggggggyke1@hotmail.com 2015-08-06 00:17:58 NaT
...
22 ffdddfddd@yahoo.com 2015-08-06 00:46:12 0 days 00:04:15
答案 0 :(得分:1)
IIUC然后您可以执行以下操作:
df['difference'] = df.groupby('email_address')['ts_placed'].diff()
df['repeat'] = df.groupby('email_address')['difference'].transform(lambda x: (x < 10).cumcount())