多列火花过滤器工作不正常

时间:2020-03-04 07:10:53

标签: java apache-spark apache-spark-sql

我有一个包含30行和10列的数据框。列名正在动态更改。该数据框在2列的8列中具有空值,我正在尝试使用带有filter运算符的sparkSQL and删除这些记录。

我看到过滤器的异常行为。当我用

构建查询框架时
df.filter('(`1Heavy Buyers` is null) and (`2Non Buyers` is null) and (`3Total Buyers` is null) and (`4Heavy Buyers` is null) and (`7Non Buyers` is null) and (`5Total Buyers` is null) and (`6Heavy Buyers` is null) and (`8Non Buyers` is null) and (`9otal Buyers` is null)')

我拿回了2条正确的记录。

但尝试使用非空行

df.filter('(`1Heavy Buyers` is not null) and (`2Non Buyers` is not null) and (`3Total Buyers` is not null) and (`4Heavy Buyers` is not null) and (`7Non Buyers` is not null) and (`5Total Buyers` is not null) and (`6Heavy Buyers` is not null) and (`8Non Buyers` is not null) and (`9otal Buyers` is not null)')

我只得到了16条记录,应该是28条。在这种情况下,即使其中一列具有空值,也将删除所有提到的列。

我正在使用spark 2.3.0。

我不明白我在做什么错。

2 个答案:

答案 0 :(得分:1)

您执行的第一个查询意味着您打算将所有记录保留在数据框中,其中提到的所有六列的值均为空。

但是,在第二个查询中,将保留提及的所有列的值都不为null的所有记录。这意味着,即使只有一列具有空值,也会将其滤除。要获得所需的结果,可以运行以下查询:

df.filter('(`1Heavy Buyers` is not null) or (`2Non Buyers` is not null) or (`3Total Buyers` is not null) or (`4Heavy Buyers` is not null) or (`7Non Buyers` is not null) or (`5Total Buyers` is not null) or (`6Heavy Buyers` is not null) or (`8Non Buyers` is not null) or (`9otal Buyers` is not null)')

上面的查询将为您提供所需的28个数字。如果您有任何疑问,请随时发表评论。

答案 1 :(得分:0)

如果是sql,我们也可以在JAVA中使用以下

df.filter(not(functions.expr('(`1Heavy Buyers` is null) and (`2Non Buyers` is null) and (`3Total Buyers` is null) and (`4Heavy Buyers` is null) and (`7Non Buyers` is null) and (`5Total Buyers` is null) and (`6Heavy Buyers` is null) and (`8Non Buyers` is null) and (`9otal Buyers` is null)')))
相关问题