Spark - 如何将JSON转义的String字段解析为DataFrames中的JSON对象?

时间:2017-06-23 19:10:44

标签: scala apache-spark apache-spark-sql spark-dataframe

我输入了一组格式为每行一个JSON对象的文件。但问题是,这些JSON对象上的一个字段是JSON转义字符串。实施例

{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":"{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}"}

当我通过读取json文件创建数据框时,它正在创建数据框,如下所示

val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json")
df: org.apache.spark.sql.DataFrame = [clientAttributes: struct<backfillId: string, clientPrimaryKey: string>, escapedJsonPayload: string]

我们可以看到“escapedJsonPayload”是String,我需要它是Struct。

注意:我在StackOverflow中得到了类似的问题并跟着它(How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?),但它给了我“[_corrupt_record:string]”

我已尝试过以下步骤

  1. val df = spark.sqlContext.read.json("file:///home/akaspate/sample.json") (Work file)

  2. val escapedJsons: RDD[String] = sc.parallelize(Seq("""df"""))

  3. val unescapedJsons: RDD[String] = escapedJsons.map(_.replace("\"{", "{").replace("\"}", "}").replace("\\\"", "\""))

  4. val dfJsons: DataFrame = spark.sqlContext.read.json(unescapedJsons) (This results in [_corrupt_record: string])

  5. 任何帮助将不胜感激

1 个答案:

答案 0 :(得分:3)

首先,您提供的JSON格式错误(语法上)。更正后的JSON如下:

{"clientAttributes":{"backfillId":null,"clientPrimaryKey":"abc"},"escapedJsonPayload":{\"name\":\"Akash\",\"surname\":\"Patel\",\"items\":[{\"itemId\":\"abc\",\"itemName\":\"xyz\"}]}}

接下来,要从上面的JSON中正确解析JSON,您必须使用以下代码:

val rdd = spark.read.textFile("file:///home/akaspate/sample.json").toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).rdd

val df = spark.read.json(rdd)

以上代码将为您提供以下输出:

df.show(false)

+----------------+-------------------------------------+
|clientAttributes|escapedJsonPayload                   |
+----------------+-------------------------------------+
|[null,abc]      |[WrappedArray([abc,xyz]),Akash,Patel]|
+----------------+-------------------------------------+

使用以下架构:

df.printSchema

root
 |-- clientAttributes: struct (nullable = true)
 |    |-- backfillId: string (nullable = true)
 |    |-- clientPrimaryKey: string (nullable = true)
 |-- escapedJsonPayload: struct (nullable = true)
 |    |-- items: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- itemId: string (nullable = true)
 |    |    |    |-- itemName: string (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- surname: string (nullable = true)

我希望这有帮助!