我有如下所示的数据框,我要对其应用下面提到的sql逻辑
df.head(25)
ORDER_ID CODE STATUS_DATE RNK
19837715 0400 22/10/19 08:11:08.000000000 AM GMT 2
19837715 0400 22/10/19 10:00:03.000000000 AM GMT 1
19837715 0400 22/10/19 10:47:08.000000000 AM GMT 3
19837715 0500 22/10/19 10:00:00.000000000 AM GMT 1
19837715 1100 01/11/19 10:02:00.000000000 AM GMT 1
19837715 1240 02/11/19 08:00:00.000000000 AM GMT 1
19837833 0400 22/10/19 08:13:09.000000000 AM GMT 3
19837833 0400 22/10/19 08:22:09.000000000 AM GMT 4
19837833 0400 23/10/19 04:30:10.000000000 AM GMT 1
19837833 0400 23/10/19 09:30:07.000000000 PM GMT 2
19837833 0500 23/10/19 01:08:00.000000000 AM GMT 1
19837833 0500 23/10/19 04:30:00.000000000 AM GMT 3
19840750 0500 23/10/19 12:30:00.000000000 PM GMT 1
19840750 1100 01/11/19 10:06:02.000000000 AM GMT 1
19840750 1240 02/11/19 08:40:05.000000000 AM GMT 1
19840750 1305 05/11/19 07:21:03.000000000 AM GMT 2
19840750 1305 05/11/19 08:22:03.000000000 AM GMT 1
19840750 1400 09/11/19 06:13:12.000000000 AM GMT 3
我想在此数据框上应用以下sql逻辑。
select
order_id
, TRUNC(MAX(decode(df.code, '0400', STATUS_DATE, Null))) act_0400
, TRUNC(MAX(decode(df.code, '0500', STATUS_DATE, Null))) act_0500
from
dataframe df
where
df.rnk =1
group by
order_id
在这里,我试图通过从状态日期列中获取条件等级= 1的最大日期值并根据订单ID对其进行分组来创建新列act_0400和act_0500
预期产量
ORDER_ID ACT_0400 ACT_0500
19837715 22/10/2019 22/10/2019
19837833 23/10/2019 23/10/2019
19840750 23/10/2019
这怎么在熊猫里做
答案 0 :(得分:2)
这是一种方法:
codes = [400, 500]
df1 = (df
.query("CODE in @codes and RNK == 1")
.groupby(['ORDER_ID','CODE'])['STATUS_DATE']
.first()
.unstack())
# fix column names
df1.columns.name = None
df1 = df1.add_prefix('ACT_').reset_index()
ORDER_ID ACT_400 ACT_500
0 19837715 2019-10-22 2019-10-22
1 19837833 2019-10-23 2019-10-23
2 19840750 NaN 2019-10-23
答案 1 :(得分:2)
您可以先用to_datetime
用Series.dt.date
将STATUS_DATE
转换为日期时间,然后用boolean indexing
用Series.isin
过滤,最后用DataFrame.pivot_table
整形使用汇总max
,最后通过DataFrame.rename_axis
,DataFrame.rename_axis
和DataFrame.reset_index
清理数据:
df['STATUS_DATE'] = pd.to_datetime(df['STATUS_DATE']).dt.date
df = (df[(df['RNK'] == 1) & df['CODE'].isin([400,500])]
.pivot_table(index="ORDER_ID", columns="CODE", values="STATUS_DATE", aggfunc='max')
.rename_axis(None, axis=1)
.add_prefix('ACT_')
.reset_index())
print (df)
ORDER_ID ACT_400 ACT_500
0 19837715 2019-10-22 2019-10-22
1 19837833 2019-10-23 2019-10-23
2 19840750 NaN 2019-10-23
答案 2 :(得分:1)
您可以执行以下操作
a = df.loc[df['RNK']==1 & (df['CODE']==400) | (df['CODE']==500)]
a.pivot(index="ORDER_ID", columns="CODE", values="STATUS_DATE").add_prefix('ACT_').reset_index().rename_axis(None, axis=1)
输出
ORDER_ID ACT_400 ACT_500
0 19837715 22/10/19 22/10/19
1 19837833 23/10/19 23/10/19
2 19840750 NaN 23/10/19