Question

所以我有一些Jsons如下

{"Location":
  {"filter":
      {"name": "houston", "Disaster": "hurricane"},
  }
}
{"Location":
  {"filter":
      {"name": "florida", "Disaster": "hurricane"},
  }
}
{"Location":
  {"filter":
      {"name": "seattle"},
  }
}

使用 spark.read.json（“myfile.json”）后，我想过滤掉不包含灾难的数据行。在我的例子中，西雅图行应该被过滤掉。

我试过

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

但是这给了我struct disaster不存在的错误。

那我该怎么做？

感谢

Answer 1

您的json数据似乎已损坏，即使用spark.read.json("myfile.json")

将无法读入有效数据框

使用wholeTextFiles api

可以解决此问题

val rdd = sc.wholeTextFiles("myfile.json")
val json = rdd.flatMap(_._2.replace(":\n", ":").replace(",\n", "").replace("}\n", "}").replace(" ", "").replace("}{", "}\n{").split("\n"))

这应该为您提供rdd数据（有效的jsons ）

{"Location":{"filter":{"name":"houston","Disaster":"hurricane"}}}
{"Location":{"filter":{"name":"florida","Disaster":"hurricane"}}}
{"Location":{"filter":{"name":"seattle"}}}

现在，您可以将json rdd读入dataframe

val df = sqlContext.read.json(json)

应该给你

+---------------------+
|Location             |
+---------------------+
|[[hurricane,houston]]|
|[[hurricane,florida]]|
|[[null,seattle]]     |
+---------------------+

以schema为

root
 |-- Location: struct (nullable = true)
 |    |-- filter: struct (nullable = true)
 |    |    |-- Disaster: string (nullable = true)
 |    |    |-- name: string (nullable = true)

现在您已拥有有效数据框，您可以应用正在申请的filter

val newTable = df.filter($"Location.filter.Disaster" isnotnull)

newTable将是

+---------------------+
|Location             |
+---------------------+
|[[hurricane,houston]]|
|[[hurricane,florida]]|
+---------------------+

DataFrame检查是否存在嵌套的json列

1 个答案: