我有一个JSON文件,我正在阅读 Spark数据帧使用Scala 2.10 和
val df = sqlContext.read.json("file_path")
JSON如下所示:
{ "data": [{ "id":"20180218","parent": [{"name": "Market"}]}, { "id":"20180219","parent": [{"name": "Client"},{"name": "Market" }]}, { "id":"20180220","parent": [{"name": "Client"}]},{ "id":"20180221","parent": []}]}
data是一个struct数组。每个结构都有父键。 Parent也是一个struct数组,可以包含0个或更多值。
我需要过滤父数组,使其仅包含名称为" Market"或无。我的输出应该如下:
{ "data": [{ "id":"20180218","parent": [{"name": "Market"}]}, { "id":"20180219","parent": [{"name": "Market" }]}, { "id":"20180220","parent": []},{ "id":"20180221","parent": []}]}
所以,基本上过滤掉除了" Market"之外的任何名称的结构。并保留空父数组(作为操作的结果,或者如果它已经为空)。
有人可以帮忙吗?
由于
答案 0 :(得分:2)
我们需要使用explode
函数来实现这种嵌套的JSON结构和数组查询。
scala> val df = spark.read.json("/Users/pavithranrao/Desktop/test.json")
scala> df.printSchema
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- parent: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
scala> val oneDF = df.select(col("data"), explode(col("data"))).toDF("data", "element").select(col("data"), col("element.parent"))
scala> oneDF.show
"""
+--------------------+--------------------+
| data| parent|
+--------------------+--------------------+
|[[20180218,Wrappe...| [[Market]]|
|[[20180218,Wrappe...|[[Client], [Market]]|
|[[20180218,Wrappe...| [[Client]]|
|[[20180218,Wrappe...| []|
+--------------------+--------------------+
"""
scala> val twoDF = oneDF.select(col("data"), explode(col("parent"))).toDF("data", "names")
scala> twoDF.printSchema
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- parent: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
|-- names: struct (nullable = true)
| |-- name: string (nullable = true)
scala> twoDF.show
"""
+--------------------+--------+
| data| names|
+--------------------+--------+
|[[20180218,Wrappe...|[Market]|
|[[20180218,Wrappe...|[Client]|
|[[20180218,Wrappe...|[Market]|
|[[20180218,Wrappe...|[Client]|
+--------------------+--------+
"""
scala> import org.apache.spark.sql.functions.length
// Extract names struct that is Empty
scala> twoDF.select(length(col("names.name")) === 0).show
+------------------------+
|(length(names.name) = 0)|
+------------------------+
| false|
| false|
| false|
| false|
+------------------------+
// Extract names strcut that doesn't have Market
scala> twoDF.select(!col("names.name").contains("Market")).show()
+----------------------------------+
|(NOT contains(names.name, Market))|
+----------------------------------+
| false|
| true|
| false|
| true|
+----------------------------------+
// Combining these two
scala> val ansDF = twoDF.select("data").filter(!col("names.name").contains("Market") || length(col("names.name")) === 0)
scala> ansDF.printSchema
// Schema same as input df
root
|-- data: array (nullable = true)
| |-- element: struct (containsNull = true)
| | |-- id: string (nullable = true)
| | |-- parent: array (nullable = true)
| | | |-- element: struct (containsNull = true)
| | | | |-- name: string (nullable = true)
scala> ansDF.show(false)
+----------------------------------------------------------------------------------------------------------------------------------------------+
|data |
+----------------------------------------------------------------------------------------------------------------------------------------------+
|[[20180218,WrappedArray([Market])], [20180219,WrappedArray([Client], [Market])], [20180220,WrappedArray([Client])], [20180221,WrappedArray()]]|
|[[20180218,WrappedArray([Market])], [20180219,WrappedArray([Client], [Market])], [20180220,WrappedArray([Client])], [20180221,WrappedArray()]]|
+----------------------------------------------------------------------------------------------------------------------------------------------+
最终的ansDF具有满足条件name
不包含'市场'的过滤记录。或isEmpty。
PS:如果我错过了确切的过滤方案,请更正 过滤函数在上面的代码中
希望这有帮助!