如果多个ID不为空,则进行内部联接

时间:2019-12-24 17:24:54

标签: sql database join hive

我在创建捕获以下记录的联接时遇到了一些麻烦。昨天我花了大约5个小时试图弄清楚,但没有。

我有两个表,表A 表B 这两个表都有以下列:

ID_1, ID_2, ID_3, ID_4 

现在,我需要在两个表之间创建一个联接,以使结果提取不为null的匹配ID上的记录,如果ID匹配超过1个,那么我将使用所有匹配的ID来提取记录,因此有几种情况:

场景1:两个表中的所有ID都完全匹配(这很容易编码)

在这里,我将通过所有ID加入。

+--------+---------+---------+--------+
| A.ID_1 |  A.ID_2 |  A.ID_3 | A.ID_4 |
+--------+---------+---------+--------+
| CAD    |   AAPL  |     853 |    200 |
+--------+---------+---------+--------+

+--------+--------+--------+--------+
| B.ID_1 | B.ID_2 | B.ID_3 | B.ID_4 |
+--------+--------+--------+--------+
| CAD    | AAPL   |    853 |    200 |
+--------+--------+--------+--------+

方案2:两个表中都有一个或多个ID匹配,其余为NULL(也很简单)

在这里,我只能通过ID_1和ID_3加入。

+--------+--------+--------+--------+
| A.ID_1 | A.ID_2 | A.ID_3 | A.ID_4 |
+--------+--------+--------+--------+
| CAD    | NULL   |    933 | NULL   |
+--------+--------+--------+--------+

+--------+--------+--------+--------+
| B.ID_1 | B.ID_2 | B.ID_3 | B.ID_4 |
+--------+--------+--------+--------+
| CAD    | NULL   |    933 | NULL   |
+--------+--------+--------+--------+

方案3:表中一个或多个ID匹配,但有些不匹配

在这里,我只需要加入ID_1和ID_2,因为ID_3和ID_4对于各自的表都是NULL。

+--------+--------+--------+--------+
| A.ID_1 | A.ID_2 | A.ID_3 | A.ID_4 |
+--------+--------+--------+--------+
| CAD    |  TSLA  |    341 | NULL   |
+--------+--------+--------+--------+

+--------+--------+--------+--------+
| B.ID_1 | B.ID_2 | B.ID_3 | B.ID_4 |
+--------+--------+--------+--------+
| CAD    |  TSLA  |  NULL  |    250 |
+--------+--------+--------+--------+

方案4:所有ID均为NULL,因此记录被拒绝

如果 表A包含以下内容:

+--------+--------+--------+--------+
| A.ID_1 | A.ID_2 | A.ID_3 | A.ID_4 |
+--------+--------+--------+--------+
| CAD    |  AAPL  |  853   |   200  |
+--------+--------+--------+--------+
| CAD    | NULL   |  933   | NULL   |
+--------+--------+--------+--------+ 
| CAD    |  TSLA  |  341   | NULL   |
+--------+--------+--------+--------+
| NULL   |  NULL  |  NULL  | NULL   |
+--------+--------+--------+--------+

表B包含以下内容:

+--------+--------+--------+--------+
| B.ID_1 | B.ID_2 | B.ID_3 | B.ID_4 |
+--------+--------+--------+--------+
| CAD    |  AAPL  |  853   |   200  |
+--------+--------+--------+--------+
| CAD    |  NULL  |  933   |  NULL  |
+--------+--------+--------+--------+ 
| CAD    |  TSLA  |  NULL  |   250  |
+--------+--------+--------+--------+
| NULL   |  NULL  |  NULL  |  NULL  |
+--------+--------+--------+--------+

结果将是:

+--------+--------+--------+--------+
| ID_1   | ID_2   | ID_3   | ID_4   |
+--------+--------+--------+--------+
| CAD    |  AAPL  |  853   |   200  |
+--------+--------+--------+--------+
| CAD    |  NULL  |  933   |  NULL  |
+--------+--------+--------+--------+ 
| CAD    |  TSLA  |  341   |  NULL  |
+--------+--------+--------+--------+

谢谢

1 个答案:

答案 0 :(得分:0)

也许您想要这样的东西?总是有人写较短的代码...:-)

WITH directMatch AS (
    SELECT
        A.*
    FROM
        A
        INNER JOIN B
            -- exclude scenario 4
            ON (
                A.ID_1 IS NOT NULL
                OR A.ID_2 IS NOT NULL
                OR A.ID_3 IS NOT NULL
                OR A.ID_4 IS NOT NULL
            )
            AND (
                B.ID_1 IS NOT NULL
                OR B.ID_2 IS NOT NULL
                OR B.ID_3 IS NOT NULL
                OR B.ID_4 IS NOT NULL
            )
            -- keep scenario 1+2
            AND (
                A.ID_1 = B.ID_1
                OR A.ID_1 IS NULL AND B.ID_1 IS NULL
            )
            AND (
                A.ID_2 = B.ID_2
                OR A.ID_2 IS NULL AND B.ID_2 IS NULL
            )
            AND (
                A.ID_3 = B.ID_3
                OR A.ID_3 IS NULL AND B.ID_3 IS NULL
            )
            AND (
                A.ID_4 = B.ID_4
                OR A.ID_4 IS NULL AND B.ID_4 IS NULL
            )
)
SELECT
    *
FROM
    -- scenario 1+2
    directMatch
UNION ALL SELECT
    A.*
FROM
    A
    INNER JOIN B
        -- exclude scenario 4
        ON (
            A.ID_1 IS NOT NULL
            OR A.ID_2 IS NOT NULL
            OR A.ID_3 IS NOT NULL
            OR A.ID_4 IS NOT NULL
        )
        AND (
            B.ID_1 IS NOT NULL
            OR B.ID_2 IS NOT NULL
            OR B.ID_3 IS NOT NULL
            OR B.ID_4 IS NOT NULL
        )
        -- scenario 3
        AND (
            COALESCE(A.ID_1, B.ID_1) = COALESCE(B.ID_1, A.ID_1)
            OR A.ID_1 IS NULL AND B.ID_1 IS NULL
        )
        AND (
            COALESCE(A.ID_2, B.ID_2) = COALESCE(B.ID_2, A.ID_2)
            OR A.ID_2 IS NULL AND B.ID_2 IS NULL
        )
        AND (
            COALESCE(A.ID_3, B.ID_3) = COALESCE(B.ID_3, A.ID_3)
            OR A.ID_3 IS NULL AND B.ID_3 IS NULL
        )
        AND (
            COALESCE(A.ID_4, B.ID_4) = COALESCE(B.ID_4, A.ID_4)
            OR A.ID_4 IS NULL AND B.ID_4 IS NULL
        )
        AND NOT EXISTS(
            SELECT
                *
            FROM
                directMatch m
            WHERE 
                (
                    A.ID_1 = m.ID_1
                    OR A.ID_1 IS NULL AND m.ID_1 IS NULL
                )
                AND (
                    A.ID_2 = m.ID_2
                    OR A.ID_2 IS NULL AND m.ID_2 IS NULL
                )
                AND (
                    A.ID_3 = m.ID_3
                    OR A.ID_3 IS NULL AND m.ID_3 IS NULL
                )
                AND (
                    A.ID_4 = m.ID_4
                    OR A.ID_4 IS NULL AND m.ID_4 IS NULL
                )
        )
        AND NOT EXISTS(
            SELECT
                *
            FROM
                directMatch m
            WHERE 
                (
                    B.ID_1 = m.ID_1
                    OR B.ID_1 IS NULL AND m.ID_1 IS NULL
                )
                AND (
                    B.ID_2 = m.ID_2
                    OR B.ID_2 IS NULL AND m.ID_2 IS NULL
                )
                AND (
                    B.ID_3 = m.ID_3
                    OR B.ID_3 IS NULL AND m.ID_3 IS NULL
                )
                AND (
                    B.ID_4 = m.ID_4
                    OR B.ID_4 IS NULL AND m.ID_4 IS NULL
                )
        )