我有一个Spark数据框,如下所示:
DESIGNATION
我想生成一个新列,比如说合并,看起来像
let (|>>) a (b : ('a -> unit) list) =
for x in b do
x a
基本上,将所有列组合到+---+------+----+
| id|animal|talk|
+---+------+----+
| 1| bat|done|
| 2| mouse|mone|
| 3| horse| gun|
| 4| horse|some|
+---+------+----+
的{{1}}中。
有人可以在Scala中帮助我吗? 在这里,为简单起见,我仅使用了两列,但是可以用于N列的通用答案将大有帮助。
答案 0 :(得分:1)
您的预期输出似乎无法反映出您生成名称-值结构化对象列表的要求。如果我理解正确,请考虑使用foldLeft
将所需的列迭代转换为StructType
名称-值列,并将它们分组为ArrayType
列:
import org.apache.spark.sql.functions._
val df = Seq(
(1, "bat", "done"),
(2, "mouse", "mone"),
(3, "horse", "gun"),
(4, "horse", "some")
).toDF("id", "animal", "talk")
val cols = df.columns.filter(_ != "id")
val resultDF = cols.
foldLeft(df)( (accDF, c) =>
accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
).
select($"id", array(cols.map(col): _*).as("merged"))
resultDF.show(false)
// +---+-----------------------------+
// |id |merged |
// +---+-----------------------------+
// |1 |[[animal,bat], [talk,done]] |
// |2 |[[animal,mouse], [talk,mone]]|
// |3 |[[animal,horse], [talk,gun]] |
// |4 |[[animal,horse], [talk,some]]|
// +---+-----------------------------+
resultDF.printSchema
// root
// |-- id: integer (nullable = false)
// |-- merged: array (nullable = false)
// | |-- element: struct (containsNull = false)
// | | |-- name: string (nullable = false)
// | | |-- value: string (nullable = true)