Question

如何通过DataFrame以JSON格式编写新列。我尝试了几种方法，但它会将数据写为JSON转义的String字段。目前其写作为 {"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}

相反，我希望它像 {"test":{"id":1,"name":"name","problem_field": {"x":100,"y":200}}}

problem_field是根据从其他字段读取的值创建的新列，例如：

val dataFrame = oldDF.withColumn("problem_field", s)

我尝试了以下方法

dataFrame.write.json(<<outputPath>>)
dataFrame.toJSON.map(value => value.replace("\\", "").replace("{\"value\":\"", "").replace("}\"}", "}")).write.json(<<outputPath>>)

也尝试转换为DataSet，但是没有运气。任何指针都将不胜感激。

我已经尝试过这里提到的逻辑：How to let Spark parse a JSON-escaped String field as a JSON Object to infer the proper structure in DataFrames?

Answer 1

对于初学者来说，示例数据在"y\":200之后带有多余的逗号，这将阻止对它的解析，因为它不是有效的JSON。

从那里，假设您知道架构，则可以使用from_json来解析该字段。在此示例中，我将分别解析字段以首先获取架构：

scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]

scala> json.printSchema
root
 |-- test: struct (nullable = true)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- problem_field: string (nullable = true)


scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{
case org.apache.spark.sql.Row(x : String) => x
})
problem_field: org.apache.spark.sql.DataFrame = [x: bigint, y: bigint]          

scala> problem_field.printSchema
root
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)

scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]

scala> fixed.printSchema
root
 |-- test: struct (nullable = false)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- problem_field: struct (nullable = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)

如果problem_field的内容的模式在行之间不一致，则此解决方案仍将起作用，但可能不是处理事物的最佳方式，因为它将产生稀疏的Dataframe，其中每一行都包含在其中遇到的每个字段problem_field。例如：

scala> val json = spark.read.json(Seq("""{"test":{"id":1,"name":"name","problem_field": "{\"x\":100,\"y\":200}"}}""", """{"test":{"id":1,"name":"name","problem_field": "{\"a\":10,\"b\":20}"}}""").toDS)
json: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]

scala> val problem_field = spark.read.json(json.select($"test.problem_field").map{case org.apache.spark.sql.Row(x : String) => x})
problem_field: org.apache.spark.sql.DataFrame = [a: bigint, b: bigint ... 2 more fields]

scala> problem_field.printSchema
root
 |-- a: long (nullable = true)
 |-- b: long (nullable = true)
 |-- x: long (nullable = true)
 |-- y: long (nullable = true)

scala> val fixed = json.withColumn("test", struct($"test.id", $"test.name", from_json($"test.problem_field", problem_field.schema).as("problem_field")))
fixed: org.apache.spark.sql.DataFrame = [test: struct<id: bigint, name: string ... 1 more field>]

scala> fixed.printSchema
root
 |-- test: struct (nullable = false)
 |    |-- id: long (nullable = true)
 |    |-- name: string (nullable = true)
 |    |-- problem_field: struct (nullable = true)
 |    |    |-- a: long (nullable = true)
 |    |    |-- b: long (nullable = true)
 |    |    |-- x: long (nullable = true)
 |    |    |-- y: long (nullable = true)

scala> fixed.select($"test.problem_field.*").show
+----+----+----+----+
|   a|   b|   x|   y|
+----+----+----+----+
|null|null| 100| 200|
|  10|  20|null|null|
+----+----+----+----+

在数百，数千或数百万行的过程中，您可以看到这将如何带来问题。

如何使用新列

1 个答案: