我从下面的scala代码中得到了输出
val aryDF = Seq((("g3,g4"),Array(("D2,D3,D1"),("D2,D5,D1")))).toDF("v1","v2")
addresses.toSeq.flatMap(s => s.split(",")).groupBy(identity).mapValues(_.size)
这里的输出如下
[D2 -> 2, D5 -> 1, D1 -> 2, D3 -> 1]
但是我试图将其作为键和值打印到如下格式的array [String,String]中
[D2, D5, D1, D3][2, 1, 2, 1]
我尝试这样做,但是它以字符串形式输出如何转换为array [string,string]。下面是我写的udf:
val countAddresses = udf((addresses: Seq[String]) => {
val mp=addresses.toSeq.flatMap(s =>s.split(",")).groupBy(identity).mapValues(_.size)
mp.keySet.mkString("[", ", ", "]") ++ mp.values.mkString("[", ",", "]")})
val df2 = aryDF.withColumn("output", countAddresses($"v2"))
答案 0 :(得分:1)
这是一种以串联的键字符串和值字符串作为元素生成ArrayType
列的方法:
import org.apache.spark.sql.functions._
val aryDF = Seq(
("g3,g4", Array("D2,D3,D1", "D2,D5,D1"))
).toDF("v1", "v2")
val countAddresses = udf( (addresses: Seq[String]) => {
val mp = addresses.flatMap(_.split(",")).groupBy(identity).mapValues(_.size)
Array(mp.keys.mkString("[", ", ", "]"), mp.values.mkString("[", ", ", "]"))
}
)
val df2 = aryDF.withColumn("output", countAddresses($"v2"))
df2.show(false)
// +-----+--------------------+--------------------------------+
// |v1 |v2 |output |
// +-----+--------------------+--------------------------------+
// |g3,g4|[D2,D3,D1, D2,D5,D1]|[[D2, D5, D1, D3], [2, 1, 2, 1]]|
// +-----+--------------------+--------------------------------+