PySpark grouping by 2 columns to get top 1 count per group

时间:2018-06-04 17:39:52

标签: python pandas pyspark pyspark-sql

I have a PySpark dataframe mydf and am grouping by 2 columns (code and col1) to obtain a resultant table with highest distinct count of third column (newid).

Eg: mydf

code   col1   newid
100     MNO      1
100     MNO      2
230     LLL      3
245     TTE      4
230     LLL      5
230     LIO      6
100     FGH      7

Expected Result:

code   col1   count(distinct newid)
100     MNO      2
230     LLL      2
245     TTE      1

Current results using the code below:

mydf.groupBy("code","col1").agg(count("newid"), countDistinct("newid")\
    .orderBy(desc("countDistinct('newid')")))

code   col1   newid
100     MNO      2
230     LLL      2
245     TTE      1
100     FGH      1
230     LIO      1

0 个答案:

没有答案