Question

我有一个数据帧“ df”，并列出了lt，如下所述。我想将列表添加为dataframe（“ df”）中的新列，从而可以得到以下结果。请为我提供最优化的方法。

输入

df => 
+---+--------                                                                     
| id| temp|
+---+-----+
|  1|tmp01|
|  2|tmp02|
|  3|tmp03|
|  4|tmp04|
+---+-----+ 

lt => 
List(1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04)

输出

+---+--------  +---+-----++---+-----++---+-----++---+-----+                                                               
| id| temp| new_col|
+---+-----++---+-----++---+-----++---+-----++---+-----+
|  1|tmp01|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  2|tmp02|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  3|tmp03|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  4|tmp04|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
+---+-----++---+-----++---+-----++---+-----++---+-----+

Answer 1

您可以使用以下方法。我已经将列表转换为String并添加为Data Frame中的新列。请检查以下代码：

**df.withColumn("new_col", lit(lt.mkString)).show(false)**
+---+--------  +---+-----++---+-----++---+-----++---+-----+                                                               
| id| temp| new_col|
+---+-----++---+-----++---+-----++---+-----++---+-----+
|  1|tmp01|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  2|tmp02|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  3|tmp03|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
|  4|tmp04|1#tmp01, 6#tmp06, 9#tmp09, 4#tmp04 |
+---+-----++---+-----++---+-----++---+-----++---+-----+

Answer 2

您需要在列表中添加一个元组：

List(("1","tmp01","a"),("2","tmp06","b"),("3","tmp09","c"),(""4","tmp04","d"))
  .toDF("id","temp","new_col")

或

yourDf.withColumn("new_col", List(("a"),("b"),("c"),("d"))
  .toDF("row1")
  .col("row1"))

此解决方案与您的输出一起使用concat（两列均应为字符串）

import org.apache.spark.sql.functions._
yourDf.withColumn("new_col", concat(col("id"),col("temp")))

如何使用Scala在Spark数据框中将列表值添加为单行？

2 个答案: