我有一个像:
这样的数据集list1 list2
a e
a e
b w
a e
a r
b c
我想找到list2中按list1分组的最频繁的项目。
list1 list2 max
a e 3
b w 1
如何摆脱b,w,1和b,c,1等数量的元素? 我希望随机拥有其中一个。
我试过这样的事情
qry1=spark.sql("SELECT list1 as clf1, list2, count(list2) AS value_count FROM table GROUP BY list2,clf1 order by value_count desc")
qry1.registerTempTable("try1")
qry2=spark.sql("select clf1 as clf2, first(value_count) as max_value from try1 group by clf2 ")
qry2.registerTempTable("try2")
qry3=qry1.join(qry2, (try1["clf1"] == try2["clf2"]) & (try1["value_count"] == try2["max_value"]), 'inner')
答案 0 :(得分:0)
我不确定您要查找的输出。还可以尝试此查询:
{{1}}