Question

我需要JSON文本中的行数，其中A.adList.optionalField = null

JSON如下：

{  
   "A":{  
      "adList":[  
         {  
            "a":"qwfqw"
         },
         {  
            "b":"fqw",
            "c":23423,
            "optionalField":null
         }
      ]
   }
}

这有效：

df.select(df("id")).where(array_contains(df("A.adList.optionalField"),4)).registerTempTable("hb")


select count(*) from hb

但是，对于NULL，我不能做同样的事情

df.select(df("id")).where(array_contains(df("A.adList.optionalField"),"null")).registerTempTable("hb")

有什么主意我可以轻松做到这一点吗？问题Check if arraytype column contains null在这里讨论了Seq [Int]中可能存在的NULL，如上所述，我正在处理Seq [Struct]中的结构中可能存在的NULL字段。

Answer 1

array_contains（）在第二个参数中不允许null。要检查数组是否为空，可以通过设置ascending = true来进行sort_array（）。然后，如果第一个元素为null，则可以再次测试它为nullull（sort_array（col（a），true）（0））

检查一下：

scala> val df = spark.read.format("json").option("multiLine","true").load("/tmp/stack/tanvi.json").toDF("id")
df: org.apache.spark.sql.DataFrame = [id: struct<adList: array<struct<a:string,b:string,c:bigint,optionalField:string>>>]

scala> df.printSchema
root
 |-- id: struct (nullable = true)
 |    |-- adList: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: string (nullable = true)
 |    |    |    |-- c: long (nullable = true)
 |    |    |    |-- optionalField: string (nullable = true)


scala> df.select(sort_array(df("id.adList.optionalField"),true)(0),size(df("id.adList.optionalField"))).show(false)
+---------------------------------------------------------------+------------------------------------------------+
|sort_array(id.adList.optionalField AS `optionalField`, true)[0]|size(id.adList.optionalField AS `optionalField`)|
+---------------------------------------------------------------+------------------------------------------------+
|null                                                           |2                                               |
+---------------------------------------------------------------+------------------------------------------------+


scala> df.select(sort_array(df("id.adList.optionalField"),true)(1),size(df("id.adList.optionalField"))).show(false)
+---------------------------------------------------------------+------------------------------------------------+
|sort_array(id.adList.optionalField AS `optionalField`, true)[1]|size(id.adList.optionalField AS `optionalField`)|
+---------------------------------------------------------------+------------------------------------------------+
|null                                                           |2                                               |
+---------------------------------------------------------------+------------------------------------------------+


scala> df.select(isnull(sort_array(df("id.adList.optionalField"),true)(0)),size(df("id.adList.optionalField"))).show(false)
+-------------------------------------------------------------------------+------------------------------------------------+
|(sort_array(id.adList.optionalField AS `optionalField`, true)[0] IS NULL)|size(id.adList.optionalField AS `optionalField`)|
+-------------------------------------------------------------------------+------------------------------------------------+
|true                                                                     |2                                               |
+-------------------------------------------------------------------------+------------------------------------------------+


scala>

在结构列表中查找空值Spark SQL

1 个答案: