我正在寻找一种方法将列名添加到数据框行的数据中。 列数可能不时不同
我是Spark 1.4.1
我是一个数据框:
修改:所有数据仅为String
类型
+---+----------+
|key| value|
+---+----------+
|foo| bar|
|bar| one, two|
+---+----------+
我想得到:
+-------+---------------------+
|key | value|
+-------+---------------------+
|key_foo| value_bar|
|key_bar| value_one, value_two|
+---+-------------------------+
我试过
import org.apache.spark.sql._
import org.apache.spark.sql.functions._
val concatColNamesWithElems = udf { seq: Seq[Row] =>
seq.map { case Row(y: String) => (col +"_"+y)}}
答案 0 :(得分:1)
将DataFrame保存为表格(例如:dfTable
),以便您在其上编写SQL。
df.registerTempTable("dfTable")
创建UDF并注册:我假设您的value
列类型为String
sqlContext.udf.register("prefix", (columnVal: String, prefix: String) =>
columnVal.split(",").map(x => prefix + "_" + x.trim).mkString(", ")
)
在查询中使用UDF
//prepare columns which have UDF and all column names with AS
//Ex: prefix(key, "key") AS key // you can this representation
val columns = df.columns.map(col => s"""prefix($col, "$col") AS $col """).mkString(",")
println(columns) //for testing how columns framed
val resultDf = sqlContext.sql("SELECT " + columns + " FROM dfTable")