DataFrame:将列名追加到行数据

时间:2017-01-30 14:48:01

标签: scala apache-spark apache-spark-sql

我正在寻找一种方法将列名添加到数据框行的数据中。 列数可能不时不同

我是Spark 1.4.1

我是一个数据框:

修改:所有数据仅为String类型

+---+----------+
|key|     value|
+---+----------+
|foo|       bar|
|bar|  one, two|
+---+----------+

我想得到:

  +-------+---------------------+
  |key    |                value|
  +-------+---------------------+
  |key_foo|            value_bar|
  |key_bar| value_one, value_two|
  +---+-------------------------+ 

我试过

 import org.apache.spark.sql._
 import org.apache.spark.sql.functions._

 val concatColNamesWithElems = udf { seq: Seq[Row] =>
     seq.map { case Row(y: String) => (col +"_"+y)}}

1 个答案:

答案 0 :(得分:1)

将DataFrame保存为表格(例如:dfTable,以便您在其上编写SQL。

df.registerTempTable("dfTable")

创建UDF并注册:我假设您的value列类型为String

sqlContext.udf.register("prefix", (columnVal: String, prefix: String) =>
  columnVal.split(",").map(x => prefix + "_" + x.trim).mkString(", ")
)

在查询中使用UDF

//prepare columns which have UDF and all column names with AS 
//Ex: prefix(key, "key") AS key // you can this representation 
val columns = df.columns.map(col => s"""prefix($col, "$col") AS $col """).mkString(",")


println(columns) //for testing how columns framed

val resultDf = sqlContext.sql("SELECT " + columns + " FROM dfTable")