如何计算数组列中的元素?

时间:2017-12-25 20:46:17

标签: scala apache-spark apache-spark-sql spark-dataframe

我尝试计算以下DataFrame中FavouriteCities列中的元素数量。

+-----------------+
| FavouriteCities |
+-----------------+
|   [NY, Canada]  |
+-----------------+

架构如下:

scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
|    |-- element: string (containsNull = true)

预期输出应该是,

+------------+-------------+
|  City      |      Count  |
+------------+-------------+
| NY         |      1      |
| Canada     |      1      |
+------------+-------------+

我已尝试使用agg()count(),但如下所示,但它无法从数组中提取单个元素,并尝试在列中查找最常见的元素集。

data.agg(count("FavouriteCities").alias("count"))

有人可以指导我吗?

1 个答案:

答案 0 :(得分:2)

要匹配您已显示的架构:

scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
 |-- FavouriteCities: array (nullable = true)
 |    |-- element: string (containsNull = true)

爆炸:

val counts = data
  .select(explode($"FavouriteCities" as "City"))
  .groupBy("City")
  .count

和聚合:

import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)