带有静态名称值的to_json spark

时间:2018-07-26 10:44:58

标签: apache-spark-sql

我有一个带有两个数组列的数据框,

+---------+-----------------------+
|itemval  |fruit                  |
+---------+-----------------------+
|[1, 2, 3]|[apple, banana, orange]|
+---------+-----------------------+

我正在尝试压缩它们并创建一个名称/值对

+---------+-----------------------+--------------------------------------+
|itemval  |fruit                  |ziped                                 |
+---------+-----------------------+--------------------------------------+
|[1, 2, 3]|[apple, banana, orange]|[[1, apple], [2, banana], [3, orange]]|
+---------+-----------------------+--------------------------------------+

然后将其转换为JSON,to_json输出的格式如下

+---------------------------------------------------------------------------+
|ziped                                                                      |
+---------------------------------------------------------------------------+
|[{"_1":"1","_2":"apple"},{"_1":"2","_2":"banana"},{"_1":"3","_2":"orange"}]|
+---------------------------------------------------------------------------+

我期望的格式是这样

 +---------------------------------------------------------------------------+
    |ziped                                                                           |
    +---------------------------------------------------------------------------+
    |[{"itemval":"1","name":"apple"},{"itemval":"2","name":"banana"},{"itemval":"3","name":"orange"}]|
    +---------------------------------------------------------------------------+

这是我的实现方式

val df1 = Seq((Array(1,2,3),Array("apple","banana","orange"))).toDF("itemval","fruit")
df1.show(false)
def zipper=udf((list1:Seq[String],list2:Seq[String]) => {
   val zipList = list2 zip list1  
 zipList

)
df1.withColumn("ziped",to_json(zipper($"fruit",$"itemval"))).drop("itemval","fruit").show(false)

1 个答案:

答案 0 :(得分:0)

这是为我工作的解决方案。创建具有新值的架构并将其强制转换为列

val schema = ArrayType(
  StructType(
    Array(
      StructField("itemval",StringType),
      StructField("name",StringType)
    )
  )
)

val casted =zival.withColumn("result",$"ziped".cast(schema))
casted.show(false)
casted.select(to_json($"result")).show(false)

输出将是

casted:org.apache.spark.sql.DataFrame
ziped:array
element:struct
_1:string
_2:string
result:array
element:struct
itemval:string
name:string

+-----------------------------------------------------------------+
|structstojson(result)                                            |
+-----------------------------------------------------------------+
|[{"itemval":"3","name":"orange"},{"itemval":"2","name":"banana"}]|
+-----------------------------------------------------------------+