我正在尝试在左联接条件下使用子查询,但是收到一条错误消息,内容为:"Error in SQL statement: AnalysisException: Table or view not found: TableD;"
,并指向子查询中的FROM TableD D2
语句。
SELECT D1.Code, D1.Description, C.InstanceKey
FROM TableA A
INNER JOIN TableB B
ON A.Key = B.Key
INNER JOIN TableC C
ON B.DetailKey = C.DetailKey
LEFT JOIN TableD D1
ON C.InstanceKey = D1.InstanceKey
AND D1.RankCnt = (SELECT MIN(D2.RankCnt)
FROM TableD D2
WHERE C.InstanceKey = D2.InstanceKey);
如果我删除子查询和硬编码D1.RankCnt = [anyValidRankCnt]
,则查询运行不会出现问题。
此问题也已发布在https://forums.databricks.com/questions/14588/why-is-subquery-in-left-join-causing-error-msg.html的Databricks社区论坛上。
答案 0 :(得分:1)
我不确定目前Spark中是否支持该特定类型的correlated subquery,尽管我能够以几种不同的方式重写它,包括使用ROW_NUMBER
。请检查这些查询在语义上是否等同于您的数据:
%sql
-- Rewrite 1: CTE
WITH cte AS
(
SELECT D1.Code, D1.Description, C.InstanceKey, ROW_NUMBER() OVER ( PARTITION BY c.InstanceKey ORDER BY D1.RankCnt ) xrank
FROM TableA A
INNER JOIN TableB B
ON A.Key = B.Key
INNER JOIN TableC C
ON B.DetailKey = C.DetailKey
LEFT JOIN TableD D1
ON C.InstanceKey = D1.InstanceKey
)
SELECT *
FROM cte
WHERE xrank = 1
-- Rewrite 2: subquery
SELECT x.Code, x.Description, C.InstanceKey
FROM TableA A
INNER JOIN TableB B
ON A.Key = B.Key
INNER JOIN TableC C
ON B.DetailKey = C.DetailKey
LEFT JOIN
(
SELECT D1.InstanceKey, D1.Code, D1.Description, D1.RankCnt
FROM TableD D1
INNER JOIN
(
SELECT InstanceKey, MIN(RankCnt) RankCnt
FROM TableD
GROUP BY InstanceKey
) D2 ON D1.InstanceKey = D2.InstanceKey
AND D1.RankCnt = D2.RankCnt
) x
ON c.InstanceKey = x.InstanceKey;
-- Rewrite 3: UNION ALL
SELECT D1.Code, D1.Description, C.InstanceKey
FROM TableA A
INNER JOIN TableB B
ON A.Key = B.Key
INNER JOIN TableC C
ON B.DetailKey = C.DetailKey
INNER JOIN TableD D1
ON C.InstanceKey = D1.InstanceKey
INNER JOIN
(
SELECT D2.InstanceKey, MIN(D2.RankCnt) RankCnt
FROM TableD D2
GROUP BY D2.InstanceKey
) x ON C.InstanceKey = x.InstanceKey
AND D1.RankCnt = x.RankCnt
UNION ALL
SELECT NULL AS Code, NULL AS Description, C.InstanceKey
FROM TableA A
INNER JOIN TableB B
ON A.Key = B.Key
INNER JOIN TableC C
ON B.DetailKey = C.DetailKey
WHERE NOT EXISTS
(
SELECT *
FROM TableD D1
WHERE C.InstanceKey = D1.InstanceKey
);