Question

我有一个具有以下总体结构的数据框： （我知道。可能会更好，但这是我必须使用的：）

| patient_id | inclusion_timestamp | pre_event_1      | post_event_1     | post_event_2     |
|------------|---------------------|------------------|------------------|------------------|
| 1          | NaN                 | 27-06-2020 12:26 | NaN              | NaN              |
| 1          | 28-06-2020 13:05    | NaN              | NaN              | NaN              |
| 1          | NaN                 | NaN              | 29-06-2020 14:00 | NaN              |
| 1          | NaN                 | NaN              | NaN              | 29-06-2020 23:57 |
| 2          | NaN                 | 29-06-2020 10:11 | NaN              | NaN              |
| 2          | 29-06-2020 18:26    | NaN              | NaN              | NaN              |
| 2          | NaN                 | NaN              | 30-06-2020 19:36 | NaN              |
| 2          | NaN                 | NaN              | NaN              | 31-06-2020 21:20 |
| 3          | NaN                 | 29-06-2020 06:35 | NaN              | NaN              |
| 3          | NaN                 | 29-06-2020 07:28 | NaN              | NaN              |
| 3          | 30-06-2020 09:06    | NaN              | NaN              | NaN              |
| 3          | NaN                 | NaN              | NaN              | 01-07-2020 12:10 |

以此类推。

我知道要从included_timestamp进行计算的唯一方法是从included_timestamp向前填充。但是，由于pre_event_1字段的列通常在计算值之前，因此这会导致错误的计算。

有什么方法可以进行向前和向后填充，但只能在同一index_col（Patient_id）上吗？这样，结果数据帧将如下所示：

| patient_id | inclusion_timestamp | pre_event_1      | post_event_1     | post_event_2     |
|------------|---------------------|------------------|------------------|------------------|
| 1          | 28-06-2020 13:05    | 27-06-2020 12:26 | NaN              | NaN              |
| 1          | 28-06-2020 13:05    | NaN              | NaN              | NaN              |
| 1          | 28-06-2020 13:05    | NaN              | 29-06-2020 14:00 | NaN              |
| 1          | 28-06-2020 13:05    | NaN              | NaN              | 29-06-2020 23:57 |
| 2          | 29-06-2020 18:26    | 29-06-2020 10:11 | NaN              | NaN              |
| 2          | 29-06-2020 18:26    | NaN              | NaN              | NaN              |
| 2          | 29-06-2020 18:26    | NaN              | 30-06-2020 19:36 | NaN              |
| 2          | 29-06-2020 18:26    | NaN              | NaN              | 31-06-2020 21:20 |
| 3          | 30-06-2020 09:06    | 29-06-2020 06:35 | NaN              | NaN              |
| 3          | 30-06-2020 09:06    | 29-06-2020 07:28 | NaN              | NaN              |
| 3          | 30-06-2020 09:06    | NaN              | NaN              | NaN              |
| 3          | 30-06-2020 09:06    | NaN              | NaN              | 01-07-2020 12:10 |

我认为答案是遍历索引列，然后在每个Patient_id中应用向前和向后填充，但是我无法使我的代码正常工作...

Answer 1

在列patient_id上使用DataFrame.groupby，并对ffill和bfill使用apply：

df['inclusion_timestamp'] = df.groupby('patient_id')['inclusion_timestamp']\
                              .apply(lambda x: x.ffill().bfill())

或者将DataFrame.groupby与Series.combine_first结合使用的另一个想法：

g = df.groupby('patient_id')['inclusion_timestamp']
df['inclusion_timestamp'] = g.ffill().combine_first(g.bfill())

使用两个连续的Series.groupby的另一个想法：

df['inclusion_timestamp'] = df['inclusion_timestamp'].groupby(df['patient_id'])\
                           .ffill().groupby(df['patient_id']).bfill()

结果：

    patient_id inclusion_timestamp       pre_event_1      post_event_1      post_event_2
0            1    28-06-2020 13:05  27-06-2020 12:26               NaN               NaN
1            1    28-06-2020 13:05               NaN               NaN               NaN
2            1    28-06-2020 13:05               NaN  29-06-2020 14:00               NaN
3            1    28-06-2020 13:05               NaN               NaN  29-06-2020 23:57
4            2    29-06-2020 18:26  29-06-2020 10:11               NaN               NaN
5            2    29-06-2020 18:26               NaN               NaN               NaN
6            2    29-06-2020 18:26               NaN  30-06-2020 19:36               NaN
7            2    29-06-2020 18:26               NaN               NaN  31-06-2020 21:20
8            3    30-06-2020 09:06  29-06-2020 06:35               NaN               NaN
9            3    30-06-2020 09:06  29-06-2020 07:28               NaN               NaN
10           3    30-06-2020 09:06               NaN               NaN               NaN
11           3    30-06-2020 09:06               NaN               NaN  01-07-2020 12:10

性能（使用timeit测量）：

df.shape
(1200000, 5)

%%timeit -n10 @Method 1 (Best Method)
263 ms ± 1.72 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @Method 2
342 ms ± 1.58 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%%timeit -n10 @Method3
297 ms ± 4.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

熊猫-前后索引填充

1 个答案: