Spark:将数据框中的列与它们之间的字符组合在一起

时间:2017-11-30 20:42:08

标签: scala apache-spark spark-dataframe

我在spark数据框中有许多列,我想将它们组合成一列,并在每列之间添加一个分隔符。我不想将所有列与将它们分开的字符组合在一起,只是其中一些。在这个例子中,我想在前两列之外的所有值之间添加一个管道。

以下是输入示例:

+---+--------+----------+----------+---------+
|id | detail | context  |     col3 |     col4|
+---+--------+----------+----------+---------+
| 1 | {blah} | service  | null     | null    |
| 2 | { blah | """ blah | """blah} | service |
| 3 | { blah | """blah} | service  | null    |
+---+--------+----------+----------+---------+

预期的输出将是这样的:

+---+--------+----------+----------+---------+--------------------------------+
|id | detail | context  |     col3 |     col4| data 
+---+--------+----------+----------+---------+--------------------------------+
| 1 | {blah} | service  | null     | null    | service||
| 2 | { blah | """ blah | """blah} | service | """blah|"""blah}|service
| 3 | { blah | """blah} | service  | null    | """blah}|service|
+---+--------+----------+----------+---------+--------------------------------+

目前,我有以下内容:

val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail")
val nonulls = df.na.fill("")
val combined = nonulls.select($"id", concat(columns.map(col):_*) as "data")

以上将列组合在一起,但不添加其他字符。如果我尝试了这些可能性,但我显然做得不对:

scala> val combined = nonulls.select($"id", concat(columns.map(col):_|*) as "data")

scala> val combined = nonulls.select($"id", concat(columns.map(col):_*, lit('|')) as "data")

scala> val combined = nonulls.select($"id", concat(columns.map(col):_*|) as "data")

任何建议都将不胜感激! :)谢谢!

2 个答案:

答案 0 :(得分:1)

这应该可以解决问题:

val columns = df.columns.filterNot(_ == "id").filterNot(_ =="detail") 
val columnsWithPipe = columns.flatMap(colname => Seq(col(colname),lit("|"))).dropRight(1)
val combined = nonulls.select($"id",concat(columnsWithPipe:_*) as "data")

答案 1 :(得分:0)

只需使用concat_ws函数...它将列与您选择的分隔符连接。

导入为 import org.apache.spark.sql.functions.concat_ws