我有一个Spark应用,尝试从多个JSON文件读取。每个文件都表示类似。但是,special_field每个文件将具有不同的键/值。它没有一致的架构。
{
"name": "Bob",
"age": 35,
"special_field": {
"my_field1": "abc"
"my_field2": 12345
"my_field3": "xyz"
}
}
代码:
case class MyObject(name: String, age: Int, specialField: JSONObject)
val myDataFrame = spark.read.json(path = "s3://bucket/*.json")
.select(properties.head, properties.tail: _*)
.map(line =>
MyObject(
name = line.getAs[String]("name"),
age = line.getAs[Int]("age"),
specialField = line.getAs[JSONObject]("special_field")
)).toDF
问题与JSON输入文件中的字段special_field
相关。它是动态的,因为架构是意外的。例如,键/值是未知的。
如果可能,我想将其作为JSONObject读取到MyObject类中。我尝试了上面的方法,但是似乎抛出了一个异常,无法强制转换为Any。是否可以将该字段的值读取为JSONObject或类似的内容?
答案 0 :(得分:0)
我对scala不太熟悉,但是大多数情况下,您可以将JSonObject转换为JsonaArray并在其上调用一个迭代器,然后使用map.entry接口获取键和值
答案 1 :(得分:0)
假设您有以下两个json文件:
json_data1
{
"name": "Bob",
"age": 35,
"special_field": {
"my_field1": "abc",
"my_field2": 12345
}
}
json_data2
{
"name": "Bob",
"age": 35,
"special_field": {
"my_field1": "abc",
"my_field2": 12345,
"my_field3": "xyz"
}
}
要将这两个文件读入一个数据帧,您可以执行与已实现的操作类似的操作:
val myDataFrame = spark
.read
.option("multiLine", true)
.option("mode", "PERMISSIVE")
.json(path = "s3://bucket/*.json")
Spark会尝试将两个模式合并为一个模式:
scala> myDataFrame.printSchema
root
|-- age: long (nullable = true)
|-- name: string (nullable = true)
|-- special_field: struct (nullable = true)
| |-- my_field1: string (nullable = true)
| |-- my_field2: long (nullable = true)
| |-- my_field3: string (nullable = true)
myDataFrame.show()
的输出为:
+---+----+-----------------+
|age|name| special_field|
+---+----+-----------------+
| 35| Bob|[abc, 12345, xyz]|
| 35| Bob| [abc, 12345,]|
+---+----+-----------------+
如您所见,Spark已经将special_field
放入了一个struct字段,您可以使用以下语句通过select语句轻松访问它:
myDataFrame.select(
"special_field.my_field1",
"special_field.my_field2",
"special_field.my_field3"
).show
//Output
+---------+---------+---------+
|my_field1|my_field2|my_field3|
+---------+---------+---------+
| abc| 12345| xyz|
| abc| 12345| null|
+---------+---------+---------+
或者您甚至可以提取special_field
的内容并将其保存为字符串,并以to_json
作为下一步:
myDataFrame.withColumn("special_field_str", to_json($"special_field"))
//Schema
//root
// |-- age: long (nullable = true)
// |-- name: string (nullable = true)
// |-- special_field: struct (nullable = true)
// | |-- my_field1: string (nullable = true)
// | |-- my_field2: long (nullable = true)
// | |-- my_field3: string (nullable = true)
// |-- special_field_str: string (nullable = true)
//Output
// +---+----+-----------------+-------------------------------------------------------+
// |age|name|special_field |special_field_str |
// +---+----+-----------------+-------------------------------------------------------+
// |35 |Bob |[abc, 12345, xyz]|{"my_field1":"abc","my_field2":12345,"my_field3":"xyz"}|
// |35 |Bob |[abc, 12345,] |{"my_field1":"abc","my_field2":12345} |
// +---+----+-----------------+-------------------------------------------------------+
然后使用以下命令访问special_field_str
的项目:
myDataFrame
.withColumn("special_field_str", to_json($"special_field"))
.select(
get_json_object($"special_field_str", "$.my_field1").as("f1"),
get_json_object($"special_field_str", "$.my_field2").as("f2"),
get_json_object($"special_field_str", "$.my_field3").as("f3")
).show
//Output
// +---+-----+----+
// | f1| f2| f3|
// +---+-----+----+
// |abc|12345| xyz|
// |abc|12345|null|
// +---+-----+----+
最后,如果您真的需要使用案例类而不是将special_field
存储到JSONObject
中,我建议将其转换为字典,以便最终的案例类看起来像下一个:
case class MyObject(name: String, age: Int, specialField: Map[String, String])
您可以使用以下代码将json字符串转换为Map[String, String]
:
val schema = MapType(StringType, StringType)
myDataFrame
.withColumn("special_field_str", to_json($"special_field"))
.withColumn("special_field_map", from_json($"special_field_str", schema))
.show(false)
输出:
+---+----+-----------------+---------------------------------------------------------+--------------------------------------------------------+
|age|name|special_field |special_field_str |special_field_map |
+---+----+-----------------+---------------------------------------------------------+--------------------------------------------------------+
|35 |Bob |[abc, 12345, xyz]|{"my_field1":"abc","my_field2":"12345","my_field3":"xyz"}|[my_field1 -> abc, my_field2 -> 12345, my_field3 -> xyz]|
|35 |Bob |[abc, 12345,] |{"my_field1":"abc","my_field2":"12345"} |[my_field1 -> abc, my_field2 -> 12345] |
+---+----+-----------------+---------------------------------------------------------+--------------------------------------------------------+