我想在top_places(列表)中应用group by。
tenant_id | device_id | top_places
-----------+-----------+------------
T1 | D2 | ['F', 'D']
T1 | D3 | ['F', 'D']
T1 | D4 | ['G', 'D']
T1 | D5 | ['G', 'Q']
T1 | D6 | ['A', 'F']
这是我在scala片段后运行时得到的结果
val results = rows.groupBy("top_places").agg(Map("*"->"count")).withColumnRenamed("COUNT(1)","Total").select("top_places","Total" ).orderBy("Total");
[List(G, D),1]
[List(A, F),1]
[List(G, Q),1]
[List(F, D),2]
我需要的是如下,我如何得到相同的?
[A,1]
[G,2]
[F,2]
[D,2]
[Q,1]
答案 0 :(得分:0)
你快到了。只需先用top_places
展平explode()
:
val rows = Seq(
("T1", "D2", Seq("F", "D")),
("T1", "D3", Seq("F", "D")),
("T1", "D4", Seq("G", "D")),
("T1", "D5", Seq("G", "Q")),
("T1", "D6", Seq("A", "F"))
).toDF("tenant_id", "device_id", "top_places")
rows.withColumn("top_place", explode($"top_places")).
groupBy("top_place").agg(Map("*"->"count")).
withColumnRenamed("COUNT(1)","Total").
orderBy("total").
show
// +---------+-----+
// |top_place|total|
// +---------+-----+
// | Q| 1|
// | A| 1|
// | G| 2|
// | F| 3|
// | D| 3|
// +---------+-----+
您也可以将agg(Map("*"->"count"))
替换为agg(count())
:
rows.withColumn("top_place", explode($"top_places")).
groupBy("top_place").agg(count("top_place").as("total")).
orderBy("total")