我有以下数据框:
df.show()
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
| address| coordinates| id|latitude|longitude| name|position| json|
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
|Balfour St / Brun...|[-27.463431, 15.352472|79.0| null| null|79 - BALFOUR ST /...| null|[-27.463431, 153.041031]|
+--------------------+--------------------+----+--------+---------+--------------------+--------+--------------------+
我想弄平json列。 我做到了:
val jsonSchema = StructType(Seq(
StructField("latitude", DoubleType, nullable = true),
StructField("longitude", DoubleType, nullable = true)))
val a = df.select(from_json(col("json"), jsonSchema) as "content")
但是
a.show() gives me :
+-------+
|content|
+-------+
| null|
+-------+
任何想法如何正确解析json col并在第二个数据帧(a)中获取内容col不为null?
原始数据显示为:
{
"id": 79,
"name": "79 - BALFOUR ST / BRUNSWICK ST",
"address": "Balfour St / Brunswick St",
"coordinates": {
"latitude": -27.463431,
"longitude": 153.041031
}
}
非常感谢
答案 0 :(得分:0)
问题是您的架构。您正在尝试访问嵌套集合值,例如常规值。我对您的架构进行了更改,它对我有用。
val df = spark.createDataset(
"""
|{
| "id": 79,
| "name": "79 - BALFOUR ST / BRUNSWICK ST",
| "address": "Balfour St / Brunswick St",
| "coordinates": {
| "latitude": -27.463431,
| "longitude": 153.041031
| }
| }
""".stripMargin :: Nil)
val jsonSchema = StructType(Seq(
StructField("name", StringType, nullable = true),
StructField("coordinates",
StructType(Seq(
StructField("latitude", DoubleType, true)
,
StructField("longitude", DoubleType, true)
)), true)
)
)
val a = df.select(from_json(col("value"), jsonSchema) as "content")
a.show(false)
输出
+--------------------------------------------------------+
|content |
+--------------------------------------------------------+
|[79 - BALFOUR ST / BRUNSWICK ST,[-27.463431,153.041031]]|
+--------------------------------------------------------+