scala group by在列表类型的Cassandra表列上

时间:2018-02-15 13:41:22

标签: scala list cassandra

我想在top_places(列表)中应用group by。

tenant_id | device_id | top_places
-----------+-----------+------------
        T1 |        D2 | ['F', 'D']
        T1 |        D3 | ['F', 'D']
        T1 |        D4 | ['G', 'D']
        T1 |        D5 | ['G', 'Q']
        T1 |        D6 | ['A', 'F']

这是我在scala片段后运行时得到的结果 val results = rows.groupBy("top_places").agg(Map("*"->"count")).withColumnRenamed("COUNT(1)","Total").select("top_places","Total" ).orderBy("Total");

[List(G, D),1]                                                                  
[List(A, F),1]
[List(G, Q),1]
[List(F, D),2]

我需要的是如下,我如何得到相同的?

[A,1]
[G,2]
[F,2]
[D,2]
[Q,1]

1 个答案:

答案 0 :(得分:0)

你快到了。只需先用top_places展平explode()

val rows = Seq(
  ("T1", "D2", Seq("F", "D")),
  ("T1", "D3", Seq("F", "D")),
  ("T1", "D4", Seq("G", "D")),
  ("T1", "D5", Seq("G", "Q")),
  ("T1", "D6", Seq("A", "F"))
).toDF("tenant_id", "device_id", "top_places")

rows.withColumn("top_place", explode($"top_places")).
  groupBy("top_place").agg(Map("*"->"count")).
  withColumnRenamed("COUNT(1)","Total").
  orderBy("total").
  show

// +---------+-----+                                                               
// |top_place|total|
// +---------+-----+
// |        Q|    1|
// |        A|    1|
// |        G|    2|
// |        F|    3|
// |        D|    3|
// +---------+-----+

您也可以将agg(Map("*"->"count"))替换为agg(count())

rows.withColumn("top_place", explode($"top_places")).
  groupBy("top_place").agg(count("top_place").as("total")).
  orderBy("total")