我在大型Pandas DataFrame上进行此操作,当然,速度非常慢。
def get_last_status_in_range(df, created_dt, created_id, window_size=15, gap_size=5):
since = created_dt - timedelta(days=(window_size + gap_size))
until = created_dt - timedelta(days=gap_size)
try:
status = df[(df.created_dt >= since) & (df.created_dt < until) &
(df.number_id == created_id)]['status'].iloc[-1]
except IndexError:
# Not found
status = None
return status
idx = 0
last_status_in_range = np.array([None] * len(df), dtype=str)
for row in df.itertuples():
created_dt = row.created_dt
created_id = row.number_id
last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
idx += 1
我的目标是给DF一个列“created_dt”,“number_id”和“status”,为每一行获取相同“number_id”的最后一个“状态”,但是在过去的指定日期范围内
到目前为止,我找到的唯一方法就是如上所述进行操作,但是对于大型DataFrame,它非常慢,我找不到一种矢量方式来做到这一点。
如何使用同一DataFrame中的某些值来矢量化操作?
编辑:
鉴于以下DF:
In [120]: df
Out[120]:
number_id created_dt status
20 BBB 2018-05-18 20:28:51.388001 u
12 BBB 2018-05-19 12:28:51.388001 u
2 CCC 2018-05-19 23:28:51.388001 u
27 CCC 2018-05-20 22:28:51.388001 a
1 CCC 2018-05-21 05:28:51.388001 u
14 BBB 2018-05-21 12:28:51.388001 r
17 AAA 2018-05-24 21:28:51.388001 a
28 CCC 2018-05-30 16:28:51.388001 a
0 AAA 2018-05-31 23:28:51.388001 r
24 CCC 2018-06-01 00:28:51.388001 r
4 BBB 2018-06-01 11:28:51.388001 r
23 BBB 2018-06-01 19:28:51.388001 r
6 AAA 2018-06-03 14:28:51.388001 a
3 CCC 2018-06-04 15:28:51.388001 u
19 AAA 2018-06-05 06:28:51.388001 u
5 AAA 2018-06-05 20:28:51.388001 r
21 AAA 2018-06-06 04:28:51.388001 a
9 BBB 2018-06-06 18:28:51.388001 r
25 AAA 2018-06-07 15:28:51.388001 r
11 BBB 2018-06-08 09:28:51.388001 r
10 BBB 2018-06-08 21:28:51.388001 u
13 BBB 2018-06-09 04:28:51.388001 a
7 AAA 2018-06-09 16:28:51.388001 r
22 AAA 2018-06-12 07:28:51.388001 r
26 BBB 2018-06-13 03:28:51.388001 u
15 AAA 2018-06-14 08:28:51.388001 a
8 CCC 2018-06-14 14:28:51.388001 r
18 CCC 2018-06-15 17:28:51.388001 u
16 BBB 2018-06-16 02:28:51.388001 a
29 AAA 2018-06-16 08:28:51.388001 r
30 AAA 2018-06-17 02:28:51.388001 a
我希望输出为:
In [124]: df
Out[124]:
number_id created_dt status prev_status
20 BBB 2018-05-18 20:28:51.388001 u None
12 BBB 2018-05-19 12:28:51.388001 u None
2 CCC 2018-05-19 23:28:51.388001 u None
27 CCC 2018-05-20 22:28:51.388001 a None
1 CCC 2018-05-21 05:28:51.388001 u None
14 BBB 2018-05-21 12:28:51.388001 r None
17 AAA 2018-05-24 21:28:51.388001 a None
28 CCC 2018-05-30 16:28:51.388001 a u
0 AAA 2018-05-31 23:28:51.388001 r a
24 CCC 2018-06-01 00:28:51.388001 r u
4 BBB 2018-06-01 11:28:51.388001 r r
23 BBB 2018-06-01 19:28:51.388001 r r
6 AAA 2018-06-03 14:28:51.388001 a a
3 CCC 2018-06-04 15:28:51.388001 u u
19 AAA 2018-06-05 06:28:51.388001 u a
5 AAA 2018-06-05 20:28:51.388001 r a
21 AAA 2018-06-06 04:28:51.388001 a r
9 BBB 2018-06-06 18:28:51.388001 r r
25 AAA 2018-06-07 15:28:51.388001 r r
11 BBB 2018-06-08 09:28:51.388001 r r
10 BBB 2018-06-08 21:28:51.388001 u r
13 BBB 2018-06-09 04:28:51.388001 a r
7 AAA 2018-06-09 16:28:51.388001 r a
22 AAA 2018-06-12 07:28:51.388001 r a
26 BBB 2018-06-13 03:28:51.388001 u r
15 AAA 2018-06-14 08:28:51.388001 a r
8 CCC 2018-06-14 14:28:51.388001 r u
18 CCC 2018-06-15 17:28:51.388001 u u
16 BBB 2018-06-16 02:28:51.388001 a a
29 AAA 2018-06-16 08:28:51.388001 r r
30 AAA 2018-06-17 02:28:51.388001 a r
如您所见,“prev_status”列中的值与上一行中匹配相同“number_id”的值相同(前一行是在将日期条件应用于“created_dt”列之后)
答案 0 :(得分:3)
这种技术使用关系代数来加速操作,而不是矢量化
使用pandas.merge_asof
,我们可以合并两个DataFrame,从第二帧中选取比较字段低于第一帧比较字段的最后一行。
创建一个名为until
的列。这是我们稍后会废弃的临时专栏
df['until'] = df.created_dt - pd.Timedelta(days=5)
将df合并到自身上直到&amp; created_dt,即最后一行,右df created_dt
在左df之前until
和number_id
对于两个dfs都相同
merged = pd.merge_asof(df, df, left_on='until', right_on='created_dt', by='number_id', suffixes=('', '_y'), allow_exact_matches=False)
将status_y
设置为np.nan
created_dt_y
之前的created_dt
- 20 days
merged.loc[~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)), 'status_y'] = np.nan
在这里,我们必须否定之后的条件,因为merged.created_dt_y
包含与过滤器不匹配的空值。
最后,选择所需的列:
merged[['number_id', 'created_dt', 'status', 'status_y']]
# outputs:
number_id created_dt status status_y
0 BBB 2018-05-18 20:28:51.388001 u NaN
1 BBB 2018-05-19 12:28:51.388001 u NaN
2 CCC 2018-05-19 23:28:51.388001 u NaN
3 CCC 2018-05-20 22:28:51.388001 a NaN
4 CCC 2018-05-21 05:28:51.388001 u NaN
5 BBB 2018-05-21 12:28:51.388001 r NaN
6 AAA 2018-05-24 21:28:51.388001 a NaN
7 CCC 2018-05-30 16:28:51.388001 a u
8 AAA 2018-05-31 23:28:51.388001 r a
9 CCC 2018-06-01 00:28:51.388001 r u
10 BBB 2018-06-01 11:28:51.388001 r r
11 BBB 2018-06-01 19:28:51.388001 r r
12 AAA 2018-06-03 14:28:51.388001 a a
13 CCC 2018-06-04 15:28:51.388001 u u
14 AAA 2018-06-05 06:28:51.388001 u a
15 AAA 2018-06-05 20:28:51.388001 r a
16 AAA 2018-06-06 04:28:51.388001 a r
17 BBB 2018-06-06 18:28:51.388001 r r
18 AAA 2018-06-07 15:28:51.388001 r r
19 BBB 2018-06-08 09:28:51.388001 r r
20 BBB 2018-06-08 21:28:51.388001 u r
21 BBB 2018-06-09 04:28:51.388001 a r
22 AAA 2018-06-09 16:28:51.388001 r a
23 AAA 2018-06-12 07:28:51.388001 r a
24 BBB 2018-06-13 03:28:51.388001 u r
25 AAA 2018-06-14 08:28:51.388001 a r
26 CCC 2018-06-14 14:28:51.388001 r u
27 CCC 2018-06-15 17:28:51.388001 u u
28 BBB 2018-06-16 02:28:51.388001 a a
29 AAA 2018-06-16 08:28:51.388001 r r
30 AAA 2018-06-17 02:28:51.388001 a r
即使是在30行的小型DataFrame上,我们也看到了大约7倍的性能提升
%timeit slow(df)
# outputs:
41 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
%timeit fast(df)
# outputs:
5.69 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
使用的代码:
def slow(df):
idx = 0
last_status_in_range = np.array([None] * len(df), dtype=str)
for row in df.itertuples():
created_dt = row.created_dt
created_id = row.number_id
last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
idx += 1
return df.assign(prev_status=last_status_in_range)
def fast(df):
d = df.assign(until = df.created_dt - pd.Timedelta(days=5))
merged = pd.merge_asof(
d, d, left_on='until', right_on='created_dt',
by='number_id', suffixes=('', '_y'),
allow_exact_matches=False
)
merged.loc[
~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)),
'status_y'
] = np.nan
return merged[['number_id', 'created_dt', 'status', 'status_y']]