Question

我有一个Spark应用，尝试从多个JSON文件读取。每个文件都表示类似。但是，special_field每个文件将具有不同的键/值。它没有一致的架构。

{
    "name": "Bob",
    "age": 35,
    "special_field": {
        "my_field1": "abc"
        "my_field2": 12345
        "my_field3": "xyz"
    }
}

代码：

case class MyObject(name: String, age: Int, specialField: JSONObject)

val myDataFrame = spark.read.json(path = "s3://bucket/*.json")
      .select(properties.head, properties.tail: _*)    
      .map(line =>
        MyObject(
          name = line.getAs[String]("name"),
          age = line.getAs[Int]("age"),
          specialField = line.getAs[JSONObject]("special_field")
       )).toDF

问题与JSON输入文件中的字段special_field相关。它是动态的，因为架构是意外的。例如，键/值是未知的。

如果可能，我想将其作为JSONObject读取到MyObject类中。我尝试了上面的方法，但是似乎抛出了一个异常，无法强制转换为Any。是否可以将该字段的值读取为JSONObject或类似的内容？

Answer 1

我对scala不太熟悉，但是大多数情况下，您可以将JSonObject转换为JsonaArray并在其上调用一个迭代器，然后使用map.entry接口获取键和值

Answer 2

假设您有以下两个json文件：

json_data1

{
    "name": "Bob",
    "age": 35,
    "special_field": {
        "my_field1": "abc",
        "my_field2": 12345
    }
}

json_data2

{
    "name": "Bob",
    "age": 35,
    "special_field": {
        "my_field1": "abc",
        "my_field2": 12345,
        "my_field3": "xyz"
    }
}

要将这两个文件读入一个数据帧，您可以执行与已实现的操作类似的操作：

val myDataFrame = spark
                  .read
                  .option("multiLine", true)
                  .option("mode", "PERMISSIVE")
                  .json(path = "s3://bucket/*.json")

Spark会尝试将两个模式合并为一个模式：

scala> myDataFrame.printSchema
root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)
 |-- special_field: struct (nullable = true)
 |    |-- my_field1: string (nullable = true)
 |    |-- my_field2: long (nullable = true)
 |    |-- my_field3: string (nullable = true)

myDataFrame.show()的输出为：

+---+----+-----------------+
|age|name|    special_field|
+---+----+-----------------+
| 35| Bob|[abc, 12345, xyz]|
| 35| Bob|    [abc, 12345,]|
+---+----+-----------------+

如您所见，Spark已经将special_field放入了一个struct字段，您可以使用以下语句通过select语句轻松访问它：

myDataFrame.select(
       "special_field.my_field1", 
       "special_field.my_field2", 
       "special_field.my_field3"
).show

//Output
+---------+---------+---------+
|my_field1|my_field2|my_field3|
+---------+---------+---------+
|      abc|    12345|      xyz|
|      abc|    12345|     null|
+---------+---------+---------+

或者您甚至可以提取special_field的内容并将其保存为字符串，并以to_json作为下一步：

myDataFrame.withColumn("special_field_str", to_json($"special_field"))

//Schema
//root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- special_field: struct (nullable = true)
// |    |-- my_field1: string (nullable = true)
// |    |-- my_field2: long (nullable = true)
// |    |-- my_field3: string (nullable = true)
// |-- special_field_str: string (nullable = true)

//Output
// +---+----+-----------------+-------------------------------------------------------+
// |age|name|special_field    |special_field_str                                      |
// +---+----+-----------------+-------------------------------------------------------+
// |35 |Bob |[abc, 12345, xyz]|{"my_field1":"abc","my_field2":12345,"my_field3":"xyz"}|
// |35 |Bob |[abc, 12345,]    |{"my_field1":"abc","my_field2":12345}                  |
// +---+----+-----------------+-------------------------------------------------------+

然后使用以下命令访问special_field_str的项目：

myDataFrame
.withColumn("special_field_str", to_json($"special_field"))
.select(
         get_json_object($"special_field_str", "$.my_field1").as("f1"),
         get_json_object($"special_field_str", "$.my_field2").as("f2"),
         get_json_object($"special_field_str", "$.my_field3").as("f3")
).show

//Output
// +---+-----+----+
// | f1|   f2|  f3|
// +---+-----+----+
// |abc|12345| xyz|
// |abc|12345|null|
// +---+-----+----+

最后，如果您真的需要使用案例类而不是将special_field存储到JSONObject中，我建议将其转换为字典，以便最终的案例类看起来像下一个：

case class MyObject(name: String, age: Int, specialField: Map[String, String])

您可以使用以下代码将json字符串转换为Map[String, String]：

val schema = MapType(StringType, StringType)

myDataFrame
.withColumn("special_field_str", to_json($"special_field"))
.withColumn("special_field_map", from_json($"special_field_str", schema))
.show(false)

输出：

+---+----+-----------------+---------------------------------------------------------+--------------------------------------------------------+
|age|name|special_field    |special_field_str                                        |special_field_map                                       |
+---+----+-----------------+---------------------------------------------------------+--------------------------------------------------------+
|35 |Bob |[abc, 12345, xyz]|{"my_field1":"abc","my_field2":"12345","my_field3":"xyz"}|[my_field1 -> abc, my_field2 -> 12345, my_field3 -> xyz]|
|35 |Bob |[abc, 12345,]    |{"my_field1":"abc","my_field2":"12345"}                  |[my_field1 -> abc, my_field2 -> 12345]                  |
+---+----+-----------------+---------------------------------------------------------+--------------------------------------------------------+

Spark应用程序从JSON文件中读取动态JSON字段的值

2 个答案: