我有一些数据(最多Event
)和预期输出(Key
,Time
),如下所示:
+----------+------------+-------+-----+------+
| Location | Date | Event | Key | Time |
+----------+------------+-------+-----+------+
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-02 | 1 | a | |
| i2 | 2019-03-04 | 1 | a | 2 |
| i2 | 2019-03-15 | 2 | b | 0 |
| i9 | 2019-02-22 | 2 | c | 0 |
| i9 | 2019-03-10 | 3 | d | |
| i9 | 2019-03-10 | 3 | d | 0 |
| s8 | 2019-04-22 | 1 | e | |
| s8 | 2019-04-25 | 1 | e | |
| s8 | 2019-04-28 | 1 | e | 6 |
| t14 | 2019-05-13 | 3 | f | |
+----------+------------+-------+-----+------+
只要Location
或Event
(或两者)更改,就会创建一个新的Key
。我主要对Time
输出感兴趣,它是每个Key
的第一行和最后一行之间的天数差异。如果Key
中有一行,则Time
为0。我们是否仍然需要创建Key
还是可以直接获得Time
的差距?
答案 0 :(得分:4)
我认为您不需要在此处创建Key
df['Time']=df.groupby(['Location','Event']).Date.\
transform(lambda x : (x.iloc[-1]-x.iloc[0]))[~df.duplicated(['Location','Event'],keep='last')]
df
Out[107]:
Location Date Event Key Time
0 i2 2019-03-02 1 a NaT
1 i2 2019-03-02 1 a NaT
2 i2 2019-03-02 1 a NaT
3 i2 2019-03-04 1 a 2 days
4 i2 2019-03-15 2 b 0 days
5 i9 2019-02-22 2 c 0 days
6 i9 2019-03-10 3 d NaT
7 i9 2019-03-10 3 d 0 days
8 s8 2019-04-22 1 e NaT
9 s8 2019-04-25 1 e NaT
10 s8 2019-04-28 1 e 6 days
11 t14 2019-05-13 3 f 0 days
答案 1 :(得分:0)
矢量化方法
df['Date'] = pd.to_datetime(df['Date'])
df['diff'] = df['Key'].ne(df['Key'].shift(-1).ffill()).astype(int)
x = df.groupby(['Location','Event'])['Date'].transform(np.ptp)
df.loc[df['diff'] == 1, 'date_diff'] = x
df
Location Date Event Key Time diff date_diff
1 i2 2019-03-02 1 a 0 NaT
2 i2 2019-03-02 1 a 0 NaT
3 i2 2019-03-02 1 a 0 NaT
4 i2 2019-03-04 1 a 2 1 2 days
5 i2 2019-03-15 2 b 0 1 0 days
6 i9 2019-02-22 2 c 0 1 0 days
7 i9 2019-03-10 3 d 0 NaT
8 i9 2019-03-10 3 d 0 1 0 days
9 s8 2019-04-22 1 e 0 NaT
10 s8 2019-04-25 1 e 0 NaT
11 s8 2019-04-28 1 e 6 1 6 days
12 t14 2019-05-13 3 f 0 NaT