我尝试计算以下DataFrame中FavouriteCities
列中的元素数量。
+-----------------+
| FavouriteCities |
+-----------------+
| [NY, Canada] |
+-----------------+
架构如下:
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
预期输出应该是,
+------------+-------------+
| City | Count |
+------------+-------------+
| NY | 1 |
| Canada | 1 |
+------------+-------------+
我已尝试使用agg()
和count()
,但如下所示,但它无法从数组中提取单个元素,并尝试在列中查找最常见的元素集。
data.agg(count("FavouriteCities").alias("count"))
有人可以指导我吗?
答案 0 :(得分:2)
要匹配您已显示的架构:
scala> val data = Seq(Tuple1(Array("NY", "Canada"))).toDF("FavouriteCities")
data: org.apache.spark.sql.DataFrame = [FavouriteCities: array<string>]
scala> data.printSchema
root
|-- FavouriteCities: array (nullable = true)
| |-- element: string (containsNull = true)
爆炸:
val counts = data
.select(explode($"FavouriteCities" as "City"))
.groupBy("City")
.count
和聚合:
import spark.implicits._
scala> counts.as[(String, Long)].reduce((a, b) => if (a._2 > b._2) a else b)
res3: (String, Long) = (Canada,1)