我正在寻找与SQL Redshift窗口函数LAST_VALUE()等效的熊猫。
我有一个序列号报告的熊猫数据框,每天都会追加。
import pandas as pd
data = {'serial_num': [123456, 678901, 123456, 678901],
'status': ['Good', 'Good', 'BAD', 'BAD'],
'last_check':['2020-03-02','2020-03-02','2020-03-01','2020-03-01']}
new_br = pd.DataFrame.from_dict(data)
new_br
serial_num status last_check
123456 Good 2020-03-02
678901 Good 2020-03-02
123456 BAD 2020-03-01
678901 BAD 2020-03-01
我希望最大值last_check
(按serial_num分组)并保留所有列(我的实际数据集有更多列)。
到目前为止,我的代码是:
new_br.set_index('last_check').groupby('serial_num').max()
serial_num status
123456 Good
678901 Good
但是,这会删除last_check
列。如何保存日期列,类似于SQL Redshift中的LAST_VALUE()函数?
我的预期输出是:
serial_num status last_check
123456 Good 2020-03-02
678901 Good 2020-03-02
答案 0 :(得分:3)
将groupby.idxmax
与loc
一起使用:
data = {'serial_num': [123456, 678901, 123456, 678901],
'status': ['Good', 'Good', 'BAD', 'BAD'],
'last_check':['2020-03-02','2020-03-02','2020-03-01','2020-03-01']}
new_br = pd.DataFrame.from_dict(data)
print(new_br.dtypes)
# serial_num int64
# status object
# last_check object
# dtype: object
# if last_check is not datetime dtype run this first
new_br['last_check'] = pd.to_datetime(new_br['last_check'])
new_br.loc[new_br.groupby('serial_num')['last_check'].idxmax()]
[出]
serial_num status last_check
0 123456 Good 2020-03-02
1 678901 Good 2020-03-02