我们正在研究Spark SQL。我们正在使用一些可为空的字符串字段进行排名。
问题是:在Spark SQL中,null
值在排名中排名第一。但是,我们希望null
的价值最终体现出来。因此,我们应用了CASE WHEN逻辑。由于我们拥有Unicode数据,因此“ ZZZZZZZZ”将不会最后出现。它将排在日文,中文地址行之前。
请让我们知道用于空字符串值的字符串常量文字,以便在ORDER BY情况下排在最后。
我在下面放了示例代码。
SELECT CompanyName,
ROW_NUMBER() OVER
(
PARTITION BY O.CompanyName
ORDER BY
CASE WHEN AddressLine1 IS NOT NULL THEN AddressLine1 ELSE "ZZZZZZZZ" END ASC
) AS BestDataForCompany
FROM CompanyData
答案 0 :(得分:2)
Spark SQL中的排名函数支持NULLS LAST
参数,因此可以正常工作:
SELECT
CompanyName,
AddressLine1,
ROW_NUMBER() OVER ( PARTITION BY CompanyName ORDER BY AddressLine1 ) BestDataForCompany1,
ROW_NUMBER() OVER ( PARTITION BY CompanyName ORDER BY CASE WHEN AddressLine1 IS NULL THEN 1 ELSE 0 END, AddressLine1 DESC ) BestDataForCompany2,
ROW_NUMBER() OVER ( PARTITION BY CompanyName ORDER BY AddressLine1 NULLS LAST ) BestDataForCompany3
FROM CompanyData
答案 1 :(得分:1)
我尚未对此进行测试-但我想您最好将它们划分为另一个组,然后再对其进行排序。然后应用您想要的真实排名:
SELECT CompanyName,
ROW_NUMBER() OVER
(
PARTITION BY O.CompanyName, CASE WHEN AddressLine1 IS NOT NULL THEN 0 ELSE 1 END
ORDER BY
CASE WHEN AddressLine1 IS NOT NULL THEN 0 ELSE 1 END, AddressLine1
) AS BestDataForCompany
FROM CompanyData