I have a PySpark dataframe mydf
and am grouping by 2 columns (code
and col1
) to obtain a resultant table with highest distinct count of third column (newid
).
Eg: mydf
code col1 newid
100 MNO 1
100 MNO 2
230 LLL 3
245 TTE 4
230 LLL 5
230 LIO 6
100 FGH 7
Expected Result:
code col1 count(distinct newid)
100 MNO 2
230 LLL 2
245 TTE 1
Current results using the code below:
mydf.groupBy("code","col1").agg(count("newid"), countDistinct("newid")\
.orderBy(desc("countDistinct('newid')")))
code col1 newid
100 MNO 2
230 LLL 2
245 TTE 1
100 FGH 1
230 LIO 1