我有一个包含30行和10列的数据框。列名正在动态更改。该数据框在2列的8列中具有空值,我正在尝试使用带有filter
运算符的sparkSQL and
删除这些记录。
我看到过滤器的异常行为。当我用
构建查询框架时df.filter('(`1Heavy Buyers` is null) and (`2Non Buyers` is null) and (`3Total Buyers` is null) and (`4Heavy Buyers` is null) and (`7Non Buyers` is null) and (`5Total Buyers` is null) and (`6Heavy Buyers` is null) and (`8Non Buyers` is null) and (`9otal Buyers` is null)')
我拿回了2条正确的记录。
但尝试使用非空行
df.filter('(`1Heavy Buyers` is not null) and (`2Non Buyers` is not null) and (`3Total Buyers` is not null) and (`4Heavy Buyers` is not null) and (`7Non Buyers` is not null) and (`5Total Buyers` is not null) and (`6Heavy Buyers` is not null) and (`8Non Buyers` is not null) and (`9otal Buyers` is not null)')
我只得到了16条记录,应该是28条。在这种情况下,即使其中一列具有空值,也将删除所有提到的列。
我正在使用spark 2.3.0。
我不明白我在做什么错。
答案 0 :(得分:1)
您执行的第一个查询意味着您打算将所有记录保留在数据框中,其中提到的所有六列的值均为空。
但是,在第二个查询中,将保留提及的所有列的值都不为null的所有记录。这意味着,即使只有一列具有空值,也会将其滤除。要获得所需的结果,可以运行以下查询:
df.filter('(`1Heavy Buyers` is not null) or (`2Non Buyers` is not null) or (`3Total Buyers` is not null) or (`4Heavy Buyers` is not null) or (`7Non Buyers` is not null) or (`5Total Buyers` is not null) or (`6Heavy Buyers` is not null) or (`8Non Buyers` is not null) or (`9otal Buyers` is not null)')
上面的查询将为您提供所需的28个数字。如果您有任何疑问,请随时发表评论。
答案 1 :(得分:0)
如果是sql,我们也可以在JAVA中使用以下
df.filter(not(functions.expr('(`1Heavy Buyers` is null) and (`2Non Buyers` is null) and (`3Total Buyers` is null) and (`4Heavy Buyers` is null) and (`7Non Buyers` is null) and (`5Total Buyers` is null) and (`6Heavy Buyers` is null) and (`8Non Buyers` is null) and (`9otal Buyers` is null)')))