DataFrame检查是否存在嵌套的json列

时间:2017-09-14 18:28:54

标签: scala apache-spark

所以我有一些Jsons如下

{"Location":
  {"filter":
      {"name": "houston", "Disaster": "hurricane"},
  }
}
{"Location":
  {"filter":
      {"name": "florida", "Disaster": "hurricane"},
  }
}
{"Location":
  {"filter":
      {"name": "seattle"},
  }
}

使用 spark.read.json(“myfile.json”)后,我想过滤掉不包含灾难的数据行。在我的例子中,西雅图行应该被过滤掉。

我试过

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

但是这给了我struct disaster不存在的错误。

那我该怎么做?

感谢

1 个答案:

答案 0 :(得分:0)

您的json数据似乎已损坏,即使用spark.read.json("myfile.json")

无法读入有效数据框

使用wholeTextFiles api

可以解决此问题
val rdd = sc.wholeTextFiles("myfile.json")
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n"))

这应该为您提供rdd数据(有效的jsons

{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}}
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}}
{"Location":{"filter":{"name":"seattle"}}}

现在,您可以将json rdd读入dataframe

val df = sqlContext.read.json(json)

应该给你

+---------------------+
|Location             |
+---------------------+
|[[hurricane,houston]]|
|[[hurricane,florida]]|
|[[null,seattle]]     |
+---------------------+

schema

root
 |-- Location: struct (nullable = true)
 |    |-- filter: struct (nullable = true)
 |    |    |-- Disaster: string (nullable = true)
 |    |    |-- name: string (nullable = true)

现在您已拥有有效数据框,您可以应用正在申请的filter

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

newTable将是

+---------------------+
|Location             |
+---------------------+
|[[hurricane,houston]]|
|[[hurricane,florida]]|
+---------------------+