所以我有一些Jsons如下
{"Location":
{"filter":
{"name": "houston", "Disaster": "hurricane"},
}
}
{"Location":
{"filter":
{"name": "florida", "Disaster": "hurricane"},
}
}
{"Location":
{"filter":
{"name": "seattle"},
}
}
使用 spark.read.json(“myfile.json”)后,我想过滤掉不包含灾难的数据行。在我的例子中,西雅图行应该被过滤掉。
我试过
val newTable = df.filter($"Location.filter.Disaster" isnotnull)
但是这给了我struct disaster不存在的错误。
那我该怎么做?
感谢
答案 0 :(得分:0)
您的json
数据似乎已损坏,即使用spark.read.json("myfile.json")
使用wholeTextFiles
api
val rdd = sc.wholeTextFiles("myfile.json")
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n"))
这应该为您提供rdd
数据(有效的jsons )
{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}}
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}}
{"Location":{"filter":{"name":"seattle"}}}
现在,您可以将json rdd
读入dataframe
val df = sqlContext.read.json(json)
应该给你
+---------------------+
|Location |
+---------------------+
|[[hurricane,houston]]|
|[[hurricane,florida]]|
|[[null,seattle]] |
+---------------------+
以schema
为
root
|-- Location: struct (nullable = true)
| |-- filter: struct (nullable = true)
| | |-- Disaster: string (nullable = true)
| | |-- name: string (nullable = true)
现在您已拥有有效数据框,您可以应用正在申请的filter
val newTable = df.filter($"Location.filter.Disaster" isnotnull)
newTable
将是
+---------------------+
|Location |
+---------------------+
|[[hurricane,houston]]|
|[[hurricane,florida]]|
+---------------------+