在结构列表中查找空值Spark SQL

时间:2019-01-30 07:47:39

标签: scala apache-spark apache-spark-sql

我需要JSON文本中的行数,其中A.adList.optionalField = null

JSON如下:

{  
   "A":{  
      "adList":[  
         {  
            "a":"qwfqw"
         },
         {  
            "b":"fqw",
            "c":23423,
            "optionalField":null
         }
      ]
   }
}

这有效:

df.select(df("id")).where(array_contains(df("A.adList.optionalField"),4)).registerTempTable("hb")


select count(*) from hb

但是,对于NULL,我不能做同样的事情

df.select(df("id")).where(array_contains(df("A.adList.optionalField"),"null")).registerTempTable("hb")

有什么主意我可以轻松做到这一点吗? 问题Check if arraytype column contains null在这里讨论了Seq [Int]中可能存在的NULL,如上所述,我正在处理Seq [Struct]中的结构中可能存在的NULL字段。

1 个答案:

答案 0 :(得分:-1)

array_contains()在第二个参数中不允许null。要检查数组是否为空,可以通过设置ascending = true来进行sort_array()。然后,如果第一个元素为null,则可以再次测试它为nullull(sort_array(col(a),true)(0))

检查一下:

scala> val df = spark.read.format("json").option("multiLine","true").load("/tmp/stack/tanvi.json").toDF("id")
df: org.apache.spark.sql.DataFrame = [id: struct<adList: array<struct<a:string,b:string,c:bigint,optionalField:string>>>]

scala> df.printSchema
root
 |-- id: struct (nullable = true)
 |    |-- adList: array (nullable = true)
 |    |    |-- element: struct (containsNull = true)
 |    |    |    |-- a: string (nullable = true)
 |    |    |    |-- b: string (nullable = true)
 |    |    |    |-- c: long (nullable = true)
 |    |    |    |-- optionalField: string (nullable = true)


scala> df.select(sort_array(df("id.adList.optionalField"),true)(0),size(df("id.adList.optionalField"))).show(false)
+---------------------------------------------------------------+------------------------------------------------+
|sort_array(id.adList.optionalField AS `optionalField`, true)[0]|size(id.adList.optionalField AS `optionalField`)|
+---------------------------------------------------------------+------------------------------------------------+
|null                                                           |2                                               |
+---------------------------------------------------------------+------------------------------------------------+


scala> df.select(sort_array(df("id.adList.optionalField"),true)(1),size(df("id.adList.optionalField"))).show(false)
+---------------------------------------------------------------+------------------------------------------------+
|sort_array(id.adList.optionalField AS `optionalField`, true)[1]|size(id.adList.optionalField AS `optionalField`)|
+---------------------------------------------------------------+------------------------------------------------+
|null                                                           |2                                               |
+---------------------------------------------------------------+------------------------------------------------+


scala> df.select(isnull(sort_array(df("id.adList.optionalField"),true)(0)),size(df("id.adList.optionalField"))).show(false)
+-------------------------------------------------------------------------+------------------------------------------------+
|(sort_array(id.adList.optionalField AS `optionalField`, true)[0] IS NULL)|size(id.adList.optionalField AS `optionalField`)|
+-------------------------------------------------------------------------+------------------------------------------------+
|true                                                                     |2                                               |
+-------------------------------------------------------------------------+------------------------------------------------+


scala>