如何使用Scala在Spark数据框中将列表值添加为单行?

时间:2019-05-31 10:49:34

标签: scala list apache-spark dataframe

我有一个数据帧“ df”,并列出了lt,如下所述。我想将列表添加为dataframe(“ df”)中的新列,从而可以得到以下结果。请为我提供最优化的方法。

输入

df => 
+---+--------                                                                     
| id| temp|
+---+-----+
|  1|tmp01|
|  2|tmp02|
|  3|tmp03|
|  4|tmp04|
+---+-----+ 

lt => 
List(1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04)

输出

+---+--------  +---+-----++---+-----++---+-----++---+-----+                                                               
| id| temp| new_col|
+---+-----++---+-----++---+-----++---+-----++---+-----+
|  1|tmp01|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  2|tmp02|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  3|tmp03|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  4|tmp04|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
+---+-----++---+-----++---+-----++---+-----++---+-----+

2 个答案:

答案 0 :(得分:1)

您可以使用以下方法。我已经将列表转换为String并添加为Data Frame中的新列。请检查以下代码:

**df.withColumn("new_col", lit(lt.mkString)).show(false)**
+---+--------  +---+-----++---+-----++---+-----++---+-----+                                                               
| id| temp| new_col|
+---+-----++---+-----++---+-----++---+-----++---+-----+
|  1|tmp01|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  2|tmp02|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  3|tmp03|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  4|tmp04|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
+---+-----++---+-----++---+-----++---+-----++---+-----+

答案 1 :(得分:-1)

您需要在列表中添加一个元组:

List(("1","tmp01","a"),("2","tmp06","b"),("3","tmp09","c"),(""4","tmp04","d"))
  .toDF("id","temp","new_col")

yourDf.withColumn("new_col", List(("a"),("b"),("c"),("d"))
  .toDF("row1")
  .col("row1"))

此解决方案与您的输出一起使用concat(两列均应为字符串)

import org.apache.spark.sql.functions._
yourDf.withColumn("new_col", concat(col("id"),col("temp")))