在Pandas中矢量化操作

时间:2018-05-21 18:11:45

标签: python pandas numpy vectorization

我在大型Pandas DataFrame上进行此操作,当然,速度非常慢。

def get_last_status_in_range(df, created_dt, created_id, window_size=15, gap_size=5):
    since = created_dt - timedelta(days=(window_size + gap_size))
    until = created_dt - timedelta(days=gap_size)
    try:
        status = df[(df.created_dt >= since) & (df.created_dt < until) &
                    (df.number_id == created_id)]['status'].iloc[-1]
    except IndexError:
        # Not found
        status = None
    return status

idx = 0
last_status_in_range = np.array([None] * len(df), dtype=str)
for row in df.itertuples():
    created_dt = row.created_dt
    created_id = row.number_id
    last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
    idx += 1

我的目标是给DF一个列“created_dt”,“number_id”和“status”,为每一行获取相同“number_id”的最后一个“状态”,但是在过去的指定日期范围内

到目前为止,我找到的唯一方法就是如上所述进行操作,但是对于大型DataFrame,它非常慢,我找不到一种矢量方式来做到这一点。

如何使用同一DataFrame中的某些值来矢量化操作?

编辑:

鉴于以下DF:

In [120]: df
Out[120]: 
   number_id                 created_dt status
20     BBB 2018-05-18 20:28:51.388001      u
12     BBB 2018-05-19 12:28:51.388001      u
2      CCC 2018-05-19 23:28:51.388001      u
27     CCC 2018-05-20 22:28:51.388001      a
1      CCC 2018-05-21 05:28:51.388001      u
14     BBB 2018-05-21 12:28:51.388001      r
17     AAA 2018-05-24 21:28:51.388001      a
28     CCC 2018-05-30 16:28:51.388001      a
0      AAA 2018-05-31 23:28:51.388001      r
24     CCC 2018-06-01 00:28:51.388001      r
4      BBB 2018-06-01 11:28:51.388001      r
23     BBB 2018-06-01 19:28:51.388001      r
6      AAA 2018-06-03 14:28:51.388001      a
3      CCC 2018-06-04 15:28:51.388001      u
19     AAA 2018-06-05 06:28:51.388001      u
5      AAA 2018-06-05 20:28:51.388001      r
21     AAA 2018-06-06 04:28:51.388001      a
9      BBB 2018-06-06 18:28:51.388001      r
25     AAA 2018-06-07 15:28:51.388001      r
11     BBB 2018-06-08 09:28:51.388001      r
10     BBB 2018-06-08 21:28:51.388001      u
13     BBB 2018-06-09 04:28:51.388001      a
7      AAA 2018-06-09 16:28:51.388001      r
22     AAA 2018-06-12 07:28:51.388001      r
26     BBB 2018-06-13 03:28:51.388001      u
15     AAA 2018-06-14 08:28:51.388001      a
8      CCC 2018-06-14 14:28:51.388001      r
18     CCC 2018-06-15 17:28:51.388001      u
16     BBB 2018-06-16 02:28:51.388001      a
29     AAA 2018-06-16 08:28:51.388001      r
30     AAA 2018-06-17 02:28:51.388001      a

我希望输出为:

In [124]: df
Out[124]: 
   number_id                 created_dt status prev_status
20     BBB 2018-05-18 20:28:51.388001      u        None
12     BBB 2018-05-19 12:28:51.388001      u        None
2      CCC 2018-05-19 23:28:51.388001      u        None
27     CCC 2018-05-20 22:28:51.388001      a        None
1      CCC 2018-05-21 05:28:51.388001      u        None
14     BBB 2018-05-21 12:28:51.388001      r        None
17     AAA 2018-05-24 21:28:51.388001      a        None
28     CCC 2018-05-30 16:28:51.388001      a           u
0      AAA 2018-05-31 23:28:51.388001      r           a
24     CCC 2018-06-01 00:28:51.388001      r           u
4      BBB 2018-06-01 11:28:51.388001      r           r
23     BBB 2018-06-01 19:28:51.388001      r           r
6      AAA 2018-06-03 14:28:51.388001      a           a
3      CCC 2018-06-04 15:28:51.388001      u           u
19     AAA 2018-06-05 06:28:51.388001      u           a
5      AAA 2018-06-05 20:28:51.388001      r           a
21     AAA 2018-06-06 04:28:51.388001      a           r
9      BBB 2018-06-06 18:28:51.388001      r           r
25     AAA 2018-06-07 15:28:51.388001      r           r
11     BBB 2018-06-08 09:28:51.388001      r           r
10     BBB 2018-06-08 21:28:51.388001      u           r
13     BBB 2018-06-09 04:28:51.388001      a           r
7      AAA 2018-06-09 16:28:51.388001      r           a
22     AAA 2018-06-12 07:28:51.388001      r           a
26     BBB 2018-06-13 03:28:51.388001      u           r
15     AAA 2018-06-14 08:28:51.388001      a           r
8      CCC 2018-06-14 14:28:51.388001      r           u
18     CCC 2018-06-15 17:28:51.388001      u           u
16     BBB 2018-06-16 02:28:51.388001      a           a
29     AAA 2018-06-16 08:28:51.388001      r           r
30     AAA 2018-06-17 02:28:51.388001      a           r

如您所见,“prev_status”列中的值与上一行中匹配相同“number_id”的值相同(前一行是在将日期条件应用于“created_dt”列之后)

1 个答案:

答案 0 :(得分:3)

这种技术使用关系代数来加速操作,而不是矢量化

使用pandas.merge_asof,我们可以合并两个DataFrame,从第二帧中选取比较字段低于第一帧比较字段的最后一行。

创建一个名为until的列。这是我们稍后会废弃的临时专栏

df['until'] = df.created_dt - pd.Timedelta(days=5)

将df合并到自身上直到&amp; created_dt,即最后一行,右df created_dt在左df之前untilnumber_id对于两个dfs都相同

merged = pd.merge_asof(df, df, left_on='until', right_on='created_dt', by='number_id', suffixes=('', '_y'), allow_exact_matches=False)

status_y设置为np.nan created_dt_y之前的created_dt - 20 days

merged.loc[~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)), 'status_y'] = np.nan

在这里,我们必须否定之后的条件,因为merged.created_dt_y包含与过滤器不匹配的空值。

最后,选择所需的列:

merged[['number_id', 'created_dt', 'status', 'status_y']]
# outputs:
   number_id                 created_dt status status_y
0        BBB 2018-05-18 20:28:51.388001      u      NaN
1        BBB 2018-05-19 12:28:51.388001      u      NaN
2        CCC 2018-05-19 23:28:51.388001      u      NaN
3        CCC 2018-05-20 22:28:51.388001      a      NaN
4        CCC 2018-05-21 05:28:51.388001      u      NaN
5        BBB 2018-05-21 12:28:51.388001      r      NaN
6        AAA 2018-05-24 21:28:51.388001      a      NaN
7        CCC 2018-05-30 16:28:51.388001      a        u
8        AAA 2018-05-31 23:28:51.388001      r        a
9        CCC 2018-06-01 00:28:51.388001      r        u
10       BBB 2018-06-01 11:28:51.388001      r        r
11       BBB 2018-06-01 19:28:51.388001      r        r
12       AAA 2018-06-03 14:28:51.388001      a        a
13       CCC 2018-06-04 15:28:51.388001      u        u
14       AAA 2018-06-05 06:28:51.388001      u        a
15       AAA 2018-06-05 20:28:51.388001      r        a
16       AAA 2018-06-06 04:28:51.388001      a        r
17       BBB 2018-06-06 18:28:51.388001      r        r
18       AAA 2018-06-07 15:28:51.388001      r        r
19       BBB 2018-06-08 09:28:51.388001      r        r
20       BBB 2018-06-08 21:28:51.388001      u        r
21       BBB 2018-06-09 04:28:51.388001      a        r
22       AAA 2018-06-09 16:28:51.388001      r        a
23       AAA 2018-06-12 07:28:51.388001      r        a
24       BBB 2018-06-13 03:28:51.388001      u        r
25       AAA 2018-06-14 08:28:51.388001      a        r
26       CCC 2018-06-14 14:28:51.388001      r        u
27       CCC 2018-06-15 17:28:51.388001      u        u
28       BBB 2018-06-16 02:28:51.388001      a        a
29       AAA 2018-06-16 08:28:51.388001      r        r
30       AAA 2018-06-17 02:28:51.388001      a        r

基准测试结果:

即使是在30行的小型DataFrame上,我们也看到了大约7倍的性能提升

%timeit slow(df)
# outputs:
41 ms ± 1.11 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit fast(df)
# outputs:
5.69 ms ± 34 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

使用的代码:

def slow(df):
  idx = 0
  last_status_in_range = np.array([None] * len(df), dtype=str)
  for row in df.itertuples():
    created_dt = row.created_dt
    created_id = row.number_id
    last_status_in_range[idx] = get_last_status_in_range(df, created_dt, created_id)
    idx += 1
  return df.assign(prev_status=last_status_in_range)

def fast(df):
  d = df.assign(until = df.created_dt - pd.Timedelta(days=5))
  merged = pd.merge_asof(
      d, d, left_on='until', right_on='created_dt', 
      by='number_id', suffixes=('', '_y'), 
      allow_exact_matches=False
  )
  merged.loc[
      ~(merged.created_dt_y >= merged.created_dt - pd.Timedelta(days=20)), 
      'status_y'
  ] = np.nan
  return merged[['number_id', 'created_dt', 'status', 'status_y']]