我有一个逻辑,通过它我可以加入多个hive表来生成以下输出。
但是,我需要一些帮助。对于相同的状态ID(即5或17),我只想保留Ser NO值最小的记录。
但是,这里的问题是,如果在状态更新后状态ID重复,(例如,状态ID 17再次出现在记录13中 - 应保留,因为它再次重新启动并返回)。
因此,如果按日期和时间以及状态排序并删除重复项,则无法满足我的目的。
我需要设置一个循环并检查状态ID是否与之前的记录相比发生了变化,如果状态ID相同,则过滤掉记录。
预期输出应为:
Ser_NO ID ID_NO STATUS DESCRIPTION initiated_dt time
1 100 10 5 Initiated 20180426 000601
3 100 10 15 BM(O) review 20180426 021424
4 100 10 17 BM(O) & SME Review 20180426 021552
7 100 10 40 Pending BSDA First Approval 20180426 021810
8 100 10 25 Pending Controller approval 20180426 021844
9 100 10 55 Booking SDA Completed 20180426 021917
11 100 10 4 Re-Initiated 20180426 021944
12 100 10 15 BM(O) review 20180426 030648
13 100 10 17 BM(O) & SME Review 20180426 030714
14 100 10 40 Pending BSDA First Approval 20180426 030734
16 100 10 25 Pending Controller approval 20180426 030805
17 100 10 55 Booking SDA Completed 20180426 030837
24 100 10 60 Shipping SDA Completed 20180426 031056
25 100 10 55 Booking SDA Completed 20180426 031124
但我想知道是否有更简单的方法来实现这一目标?
Ser_NO ID ID_NO STATUS DESCRIPTION initiated_dt time
1 100 10 5 Initiated 20180426 000601
2 100 10 5 Initiated 20180426 021408
3 100 10 15 BM(O) review 20180426 021424
4 100 10 17 BM(O) & SME Review 20180426 021552
5 100 10 17 BM(O) & SME Review 20180426 021621
6 100 10 17 BM(O) & SME Review 20180426 021639
7 100 10 40 Pending BSDA First Approval 20180426 021810
8 100 10 25 Pending Controller approval 20180426 021844
9 100 10 55 Booking SDA Completed 20180426 021917
10 100 10 55 Booking SDA Completed 20180426 021917
11 100 10 4 Re-Initiated 20180426 021944
12 100 10 15 BM(O) review 20180426 030648
13 100 10 17 BM(O) & SME Review 20180426 030714
14 100 10 40 Pending BSDA First Approval 20180426 030734
15 100 10 40 Pending BSDA First Approval 20180426 030805
16 100 10 25 Pending Controller approval 20180426 030805
17 100 10 55 Booking SDA Completed 20180426 030837
18 100 10 55 Booking SDA Completed 20180426 030837
24 100 10 60 Shipping SDA Completed 20180426 031056
25 100 10 55 Booking SDA Completed 20180426 031124