将String列另存为真实的Json值-Scala

时间:2018-10-30 23:25:59

标签: json scala apache-spark apache-spark-sql

我有一个用例,其中列的架构为String,但实际上是json(例如“”“ {” a“:” b“}”“”)。例如:

scala> val list = List("a" -> """ {"a":"b","c":"d"} """, "b" -> """ {"foo" : "bar"} """)
list: List[(String, String)] = List((a," {"a":"b","c":"d"} "), (b," {"foo" : "bar"} "))

scala> val df = list.toDF("colA","colB")
df: org.apache.spark.sql.DataFrame = [colA: string, colB: string]

scala> df.show(2,false)
+----+-------------------+
|colA|colB               |
+----+-------------------+
|a   | {"a":"b","c":"d"} |
|b   | {"foo" : "bar"}   |
+----+-------------------+

我需要将df写为json,但对于colB,我需要输出true json,而不是String。例如,如果我这样做:

scala> df.repartition(1).write.json("/Users/myuser/sparkjson/3")

我以字符串形式进入json文件colB:

{"colA":"a","colB":" {\"a\":\"b\",\"c\":\"d\"} "}
{"colA":"b","colB":" {\"foo\":\"bar\"} "}

但是我想要为colB(而不是字符串)输出true json。像这样:

{"colA":"a","colB": {"a":"b","c":"d"} }
{"colA":"b","colB": {"foo":"bar"} }

不幸的是,我没有colB的架构,它可以是任何有效的json。我该如何实现?

2 个答案:

答案 0 :(得分:0)

您需要使用正确的架构创建DataFrame,在这种情况下,colB实际上是Map[String, String],执行此操作的简单方法是创建{{1} }和spark将自动找出您的架构。这是代码:

case class

结果如下:

import org.json4s._
import org.json4s.jackson.JsonMethods._
implicit val formats = DefaultFormats

case class Data(colA:String, colB:Map[String, String])

val list:List[Data] =
List("a" -> """ {"a":"b","c":"d"} """, "b" -> """ {"foo" : "bar"} """) .map {
  case (colA, colB) => 
      Data(
       colA,
       parse(colB).extract[Map[String, String]]
      )
}

val df = spark.createDataset( list )
df.write.json("/tmp/a.json")

答案 1 :(得分:0)

对这种解决方案不太确定,但是可以尝试添加如下所示的选项-

scala> df.repartition(1).write.option("escapeQuotes","false").json("/Users/myuser/sparkjson/3")