在运行时选择所有列spark SQL,无需预定义架构

时间:2018-10-03 21:57:58

标签: apache-spark apache-spark-sql

我有一个数据框,其值的格式为

|resourceId|resourceType|seasonId|seriesId|
+----------+------------+--------+--------+
|1234      |cM-type     |883838  |8838832 |
|1235      |cM-type     |883838  |8838832 |
|1236      |cM-type     |883838  |8838832 |
|1237      |CNN-type    |883838  |8838832 |
|1238      |cM-type     |883838  |8838832 |
+----------+------------+--------+--------+

我想将数据框转换为这种格式

+----------+----------------------------------------------------------------------------------------+
|resourceId|value                                                                                   |
+----------+----------------------------------------------------------------------------------------+
|1234      |{"resourceId":"1234","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1235      |{"resourceId":"1235","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1236      |{"resourceId":"1236","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
|1237      |{"resourceId":"1237","resourceType":"CNN-type","seasonId":"883838","seriesId":"8838832"}|
|1238      |{"resourceId":"1238","resourceType":"cM-type","seasonId":"883838","seriesId":"8838832"} |
+----------+----------------------------------------------------------------------------------------+

我知道我可以这样手动输入字段来获得所需的输出

val jsonformated=df.select($"resourceId",to_json(struct($"resourceId", $"resourceType", $"seasonId",$"seriesId")).alias("value"))

但是,我试图使用

将列值传递给struct programmatic
val cols = df.columns.toSeq
val jsonformatted=df.select($"resourceId",to_json(struct("colval",cols)).alias("value"))

某种原因struct函数没有采用序列,从api看来,好像有一个方法签名可以接受序列,

struct(String colName, scala.collection.Seq<String> colNames)

有没有更好的解决方案来解决这个问题。

更新:

答案指出了获取输出的确切语法

val colsList = df.columns.toList
 val column: List[Column] = colsList.map(dftrim(_))
 val jsonformatted=df.select($"resourceId",to_json(struct(column:_*)).alias("value"))

1 个答案:

答案 0 :(得分:2)

struct采用一个序列。您只是在查看错误的变体。使用

def struct(cols: Column*): Column 

例如

import org.apache.spark.sql.functions._

val cols: Seq[String] = ???

struct(cols map col: _*)