如何在Spark数据帧中连接条件?

时间:2018-05-05 10:25:01

标签: apache-spark apache-spark-sql

例如,原始数据帧如下:

+--------+--------+
|    col1|    col2|
+--------+--------+
|    null|       A|
|       B|    null|
|       C|       D|
|    null|    null|
+--------+--------+

我想连结col1col2以获取以下数据框:

+--------+--------+-------------------+
|    col1|    col2|               col3|
+--------+--------+-------------------+
|    null|       A|         "{col2:A}"|
|       B|    null|         "{col1:B}"|
|       C|       D| "{col1:C, col2:D}"|
|    null|    null|               "{}"|
+--------+--------+-------------------+

新的col3由非null col1和非null col2连接在一起。 col3是字符串类型。如何将空条件添加到concat方法?

1 个答案:

答案 0 :(得分:2)

您可以将列组合成数组

import org.apache.spark.sql.functions._

val  df = Seq((null, "A"), ("B", null), ("C", "D"), (null, null)).toDF("colA", "colB")

val cols = array(df.columns.map(c =>
  // If column is not null, merge it with its name otherwise null
  when(col(c).isNotNull, concat_ws(":", lit(c), col(c)))): _*
)

并使用UserDefinedFunction

val combine = udf((xs: Seq[String]) => {
   val tmp = xs.filter { _ != null }.mkString(",")
   s"{$tmp}"
})

df.withColumn("col3", combine(cols)).show
// +----+----+---------------+
// |colA|colB|           col3|
// +----+----+---------------+
// |null|   A|       {colB:A}|
// |   B|null|       {colA:B}|
// |   C|   D|{colA:C,colB:D}|
// |null|null|             {}|
// +----+----+---------------+