Scala将Array [String]的列连接为单个Array [String]

时间:2018-12-12 05:37:13

标签: scala apache-spark dataframe data-science

我有一个带有id - (Int)tokens - (array<string>)列的Spark Dataframe(Scala):

id,tokens
0,["a","b","c"]
1,["a","b"]
...

假设我能够通过SparkSession检索数据并将其转换为case类:

case class Token(id: Int, tokens: Array[String])

获得Dataset[Token]对象后,如何将所有字符串标记数组连接到单个Array<String>中,然后进行计数以查找出现次数最多的字符串?

输出:

a,2
b,2
c,1
...

1 个答案:

答案 0 :(得分:2)

您需要explode令牌列并在按各个令牌分组后进行计数:

scala> val input = sc.parallelize(List(
  (0, Array("a","b","c")), 
  (1, Array("a","b"))
)).toDF("id","token")

scala> input.withColumn("token_split",explode($"token"))
         .groupBy($"token_split")
         .agg(count($"id") as "count")
         .orderBy($"count".desc)
         .show

输出:

+-----------+-----+
|token_split|count|
+-----------+-----+
|          b|    2|
|          a|    2|
|          c|    1|
+-----------+-----+