如何在scala mapValues输出中分别打印键和值?

时间:2019-01-31 03:36:27

标签: scala apache-spark

我从下面的scala代码中得到了输出

    val aryDF = Seq((("g3,g4"),Array(("D2,D3,D1"),("D2,D5,D1")))).toDF("v1","v2")
    addresses.toSeq.flatMap(s => s.split(",")).groupBy(identity).mapValues(_.size)

这里的输出如下

    [D2 -> 2, D5 -> 1, D1 -> 2, D3 -> 1]

但是我试图将其作为键和值打印到如下格式的array [String,String]中

    [D2, D5, D1, D3][2, 1, 2, 1]

我尝试这样做,但是它以字符串形式输出如何转换为array [string,string]。下面是我写的udf:

    val countAddresses = udf((addresses: Seq[String]) => {
    val mp=addresses.toSeq.flatMap(s =>s.split(",")).groupBy(identity).mapValues(_.size)
    mp.keySet.mkString("[", ", ", "]") ++ mp.values.mkString("[", ",", "]")})

    val df2 = aryDF.withColumn("output", countAddresses($"v2"))

1 个答案:

答案 0 :(得分:1)

这是一种以串联的键字符串和值字符串作为元素生成ArrayType列的方法:

import org.apache.spark.sql.functions._

val aryDF = Seq(
  ("g3,g4", Array("D2,D3,D1", "D2,D5,D1"))
).toDF("v1", "v2")

val countAddresses = udf( (addresses: Seq[String]) => {
    val mp = addresses.flatMap(_.split(",")).groupBy(identity).mapValues(_.size)
    Array(mp.keys.mkString("[", ", ", "]"), mp.values.mkString("[", ", ", "]"))
  }
)

val df2 = aryDF.withColumn("output", countAddresses($"v2"))

df2.show(false)
// +-----+--------------------+--------------------------------+
// |v1   |v2                  |output                          |
// +-----+--------------------+--------------------------------+
// |g3,g4|[D2,D3,D1, D2,D5,D1]|[[D2, D5, D1, D3], [2, 1, 2, 1]]|
// +-----+--------------------+--------------------------------+