SparkSQL“CASE WHEN THEN”在pyspark中有两个表列

时间:2017-11-15 15:19:52

标签: apache-spark apache-spark-sql spark-dataframe pyspark-sql

我有两个临时表table_atable_b,并尝试让此查询及其所有条件正常工作。

SELECT DISTINCT CASE WHEN a.id IS NULL THEN b.id ELSE a.id END id,
    CASE WHEN a.num IS NULL THEN b.num ELSE a.num END num,
    CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END testdate
FROM table_a a
    FULL OUTER JOIN table_b b
    ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
    (CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END)
    <>
    (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END) OR
    (CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END)
    <>
    (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY
    CASE WHEN a.id IS NULL THEN b.id ELSE a.id END,
    CASE WHEN a.num IS NULL THEN b.num ELSE a.num END,
    CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END

使用SparkSQL对这两个表执行上述查询会产生以下错误

sqlq = <the sql from above>

df = sqlContext.sql(sqlq)
  

“AnalysisException:u”无法解析给定输入列的“a.id”:[id,num,testdate];“

1 个答案:

答案 0 :(得分:1)

您的错误似乎在ORDER BY子句中,因为它没有表ab的概念,只有SELECT子句中的名称和别名。
这很有道理,因为您实际上只应根据结果集中实际的列来排序结果。

SELECT DISTINCT (CASE WHEN a.id IS NULL THEN b.id ELSE a.id END) AS id,
    (CASE WHEN a.num IS NULL THEN b.num ELSE a.num END) AS num,
    (CASE WHEN a.testdate IS NULL THEN b.testdate ELSE a.testdate END) AS testdate
FROM table_a AS a
    FULL OUTER JOIN table_b AS b
    ON (a.id=b.id AND a.num=b.num AND a.testdate=b.testdate)
WHERE
    (CASE WHEN a.t_amt IS NULL THEN 0 ELSE a.t_amt END) <> (CASE WHEN b.t_amt IS NULL THEN 0 ELSE b.t_amt END)
    OR
    (CASE WHEN a.qty IS NULL THEN 0 ELSE a.qty END) <> (CASE WHEN b.qty IS NULL THEN 0 ELSE b.qty END)
ORDER BY id, num, testdate