我想找到列日期(pd.to_datetime格式)的最小值,而不是“ 1777-07-07”,这基本上是离群值。输入数据帧如图所示
col2 date
b1a2 1777-07-07
b1a2 2012-09-14
b1a2 1777-07-07
b1a2 1777-07-07
b1a2 2017-09-14
b1a2 2019-09-24
b1a2 2012-09-14
b1a2 2012-09-14
b1a2 2012-09-28
a1l2 1777-07-07
a1l2 2012-09-24
a1l2 2012-09-24
a1l2 2002-09-28
a1l2 2012-09-24
a1l2 2008-09-14
a1l2 2012-09-24
所以当我尝试以下内容
df = df.join(df.groupby(['col2'])['date'].agg({'earliest':'min'}),on=['disability_case_id'])
df = df.join(df.groupby(['col2'])['date'].agg({'latest':'max'}),on=['disability_case_id'])
这给了我最大和最小的值,如图所示
col2 date earliset latest
b1a2 1777-07-07 1777-07-07 2019-09-24
b1a2 2012-09-14 1777-07-07 2019-09-24
b1a2 2017-09-14 1777-07-07 2019-09-24
b1a2 2019-09-24 1777-07-07 2019-09-24
b1a2 2012-09-14 1777-07-07 2019-09-24
b1a2 2012-09-14 1777-07-07 2019-09-24
b1a2 2012-09-28 1777-07-07 2019-09-24
a1l2 1777-07-07 1777-07-07 2012-09-28
a1l2 2012-09-24 1777-07-07 2012-09-28
a1l2 2012-09-28 1777-07-07 2012-09-28
a1l2 2002-09-28 1777-07-07 2012-09-28
a1l2 2012-09-24 1777-07-07 2012-09-28
a1l2 2008-09-14 1777-07-07 2012-09-28
a1l2 2012-09-24 1777-07-07 2012-09-28
但是我想避免离群值,我的预期输出是
b1a2 1777-07-07 2012-09-14 2019-09-24
b1a2 2012-09-14 2012-09-14 2019-09-24
b1a2 2017-09-14 2012-09-14 2019-09-24
b1a2 2019-09-24 2012-09-14 2019-09-24
b1a2 2012-09-14 2012-09-14 2019-09-24
b1a2 2012-09-14 2012-09-14 2019-09-24
b1a2 2012-09-28 2012-09-14 2019-09-24
a1l2 1777-07-07 2002-09-28 2012-09-28
a1l2 2012-09-24 2002-09-28 2012-09-28
a1l2 2012-09-28 2002-09-28 2012-09-28
a1l2 2002-09-28 2002-09-28 2012-09-28
a1l2 2012-09-24 2002-09-28 2012-09-28
a1l2 2008-09-14 2002-09-28 2012-09-28
a1l2 2012-09-24 2002-09-28 2012-09-28
答案 0 :(得分:2)
在恒定离群值的情况下,在分组依据之前遮罩。使用transform
广播回原始DataFrame。
df['date'] = pd.to_datetime(df.date)
s = df.date.where(df.date.ne('1777-07-07')).groupby(df.col2)
df['earliest'] = s.transform('min')
df['latest'] = s.transform('max')
col2 date earliest latest
0 b1a2 1777-07-07 2012-09-14 2019-09-24
1 b1a2 2012-09-14 2012-09-14 2019-09-24
2 b1a2 1777-07-07 2012-09-14 2019-09-24
3 b1a2 1777-07-07 2012-09-14 2019-09-24
4 b1a2 2017-09-14 2012-09-14 2019-09-24
5 b1a2 2019-09-24 2012-09-14 2019-09-24
6 b1a2 2012-09-14 2012-09-14 2019-09-24
7 b1a2 2012-09-14 2012-09-14 2019-09-24
8 b1a2 2012-09-28 2012-09-14 2019-09-24
9 a1l2 1777-07-07 2002-09-28 2012-09-24
10 a1l2 2012-09-24 2002-09-28 2012-09-24
11 a1l2 2012-09-24 2002-09-28 2012-09-24
12 a1l2 2002-09-28 2002-09-28 2012-09-24
13 a1l2 2012-09-24 2002-09-28 2012-09-24
14 a1l2 2008-09-14 2002-09-28 2012-09-24
15 a1l2 2012-09-24 2002-09-28 2012-09-24
答案 1 :(得分:1)
屏蔽无效值,然后像以前一样继续操作。
u = df['date'].mask(df['date'].eq('1777-07-07')).groupby(df['col2']).agg(['min', 'max'])
df.merge(u, left_on='col2', right_index=True)
col2 date min max
0 b1a2 1777-07-07 2012-09-14 2019-09-24
1 b1a2 2012-09-14 2012-09-14 2019-09-24
2 b1a2 1777-07-07 2012-09-14 2019-09-24
3 b1a2 1777-07-07 2012-09-14 2019-09-24
4 b1a2 2017-09-14 2012-09-14 2019-09-24
5 b1a2 2019-09-24 2012-09-14 2019-09-24
6 b1a2 2012-09-14 2012-09-14 2019-09-24
7 b1a2 2012-09-14 2012-09-14 2019-09-24
8 b1a2 2012-09-28 2012-09-14 2019-09-24
9 a1l2 1777-07-07 2002-09-28 2012-09-24
10 a1l2 2012-09-24 2002-09-28 2012-09-24
11 a1l2 2012-09-24 2002-09-28 2012-09-24
12 a1l2 2002-09-28 2002-09-28 2012-09-24
13 a1l2 2012-09-24 2002-09-28 2012-09-24
14 a1l2 2008-09-14 2002-09-28 2012-09-24
15 a1l2 2012-09-24 2002-09-28 2012-09-24