问题是当连接键不总是唯一时连接具有相似数据的2个表(将新数据从临时表导入活动表)
在这种情况下,我需要按ID顺序解决重复(每个ID应该只有一次,这意味着每个下一个 table1.id 取第一个非使用 table2.id ,反之亦然)。
注意:
考虑这些数据/表格
|| Imported | || Live |
| Id | guid | key1 | key2 | unimportant | | Id | origGuid | key1 | key2 | important |
| 1 | 1001 | 1 | '01' | 'a' | | 15 | 1001 | 1 | '01' | 'imported' |
| 2 | 1002 | null | '02' | 'b' | | 16 | 1002 | null | '02' | 'imported' |
| 3 | 1003 | null | '02' | 'c' | | 17 | null | null | '02' | 'user restor' |
| 5 | 1005 | 5 | '05' | 'd' | | 18 | 1004 | 4 | '04' | 'imported' |
| 19 | null | null | '02' | 'user new' |
我想得到:
在这里,我将查询准备数据
CREATE TEMPORARY TABLE imported (id serial, guid decimal(30,0), key1 integer, key2 varchar, unimportant varchar);
INSERT INTO imported VALUES (1, 1001, 1, '01', 'a');
INSERT INTO imported VALUES (2, 1002, null, '02', 'b');
INSERT INTO imported VALUES (3, 1003, null, '02', 'c');
INSERT INTO imported VALUES (5, 1005, 5, '05', 'd');
CREATE TEMPORARY TABLE live (id serial, orig_guid integer, key1 integer, key2 varchar, important varchar);
INSERT INTO live VALUES (15, 1001, 1, '01', 'imported');
INSERT INTO live VALUES (16, 1002, null, '02', 'imported');
INSERT INTO live VALUES (17, null, null, '02', 'user restor');
INSERT INTO live VALUES (18, 1004, 4, '04', 'imported');
INSERT INTO live VALUES (19, null, null, '02', 'user new');
我像这样使用旧查询。但它很慢(嵌套循环连接),结果并不完美(未解决重复)
SELECT DISTINCT imported.id AS imported_id, live.id AS live_id
FROM live
INNER JOIN imported ON
live.orig_guid = imported.guid OR (
(live.orig_guid IS NULL OR imported.guid IS NULL) AND
(live.key1 IS NULL AND imported.key1 IS NULL OR live.key1 = imported.key1) AND
(live.key2 IS NULL AND imported.key2 IS NULL OR live.key2 = imported.key2)
)
ORDER BY live.id ASC, imported.id ASC
在优化查询中,我使用UNION命令将SELECT拆分为2,并使用COALESCE减少OR以加速
WITH
liveT AS (SELECT id, COALESCE(orig_guid,0) AS guid, COALESCE(key1,0) AS key1, COALESCE(key2,'null') AS key2 FROM live),
importedT AS (SELECT id, COALESCE(guid,0) AS guid, COALESCE(key1,0) AS key1, COALESCE(key2,'null') AS key2 FROM imported),
join1 AS (
SELECT imported.id AS imported_id, live.id AS live_id FROM imported
INNER JOIN live ON imported.guid = live.orig_guid AND imported.guid <> 0 AND live.orig_guid <> 0
),
joins AS (
SELECT imported.id AS imported_id, live.id AS live_id FROM importedT imported
INNER JOIN liveT live ON
(live.guid = 0 OR imported.guid = 0) AND
live.key1 = imported.key1 AND
(live.key2 = imported.key2) -- I have in one key "OR imported.key2 = 'null'" because is new property and is not so strict
-- To reduce records i use AntiJoin
LEFT OUTER JOIN join1 ON join1.imported_id = imported.id
WHERE join1.imported_id IS NULL
UNION
SELECT imported_id, live_id FROM join1
)
SELECT DISTINCT imported_id, live_id FROM joins
ORDER BY imported_id ASC NULLS LAST, live_id ASC NULLS LAST
但结果并不完美,并使用3个类似的查询
查询结果是:
|| Old | || Optimized | || Expected |
import_id | live_id import_id | live_id import_id | live_id
1 | 15 1 | 15 1 | 15
2 | 16 2 | 16 2 | 16
2 | 17 3 | 17 3 | 17
2 | 19 3 | 19 5 | null
3 | 17 null | 18
3 | 19 null | 19
答案 0 :(得分:1)
你的问题不是很清楚。
您已将示例数据包含为INSERT
语句 - 这非常有用并且有助于回答这个问题。你已经展示了预期的结果 - 这也很棒。通常,如果您用简单的英语解释此结果背后所需的逻辑,它会有所帮助。这部分问题不太清楚。
查看您尝试过的查询我猜测Imported
和Live
表格应该同时加入key1
和key2
。最重要的是,如果一对(key1, key2)
不唯一,则表应按id
列定义的顺序逐行连接。
此外,key1
和key2
都可以是NULL
,因此NULL
值应替换为0
和"null"
。
<强>查询强>
rn_imported
和rn_live
是子查询,其中包含一个由ROW_NUMBER()
函数生成的行号的额外列。
然后这些子查询在key1, key2, rn
上完全连接在一起。
请参阅SQL Fiddle。
SELECT
imported_id
,live_id
FROM
(
SELECT
id AS imported_id
,COALESCE(key1, 0) AS key1
,COALESCE(key2, 'null') AS key2
,ROW_NUMBER() OVER (PARTITION BY key1, key2 ORDER BY id) AS rn
FROM imported
) AS rn_imported
FULL JOIN
(
SELECT
id AS live_id
,COALESCE(key1, 0) AS key1
,COALESCE(key2, 'null') AS key2
,ROW_NUMBER() OVER (PARTITION BY key1, key2 ORDER BY id) AS rn
FROM live
) AS rn_live
ON rn_imported.key1 = rn_live.key1
AND rn_imported.key2 = rn_live.key2
AND rn_imported.rn = rn_live.rn
ORDER BY imported_id ASC NULLS LAST, live_id ASC NULLS LAST
<强>结果强>
| imported_id | live_id |
|-------------|---------|
| 1 | 15 |
| 2 | 16 |
| 3 | 17 |
| 5 | (null) |
| (null) | 18 |
| (null) | 19 |
为了使此方法尽可能高效,您应该设置key1
和key2
列NOT NULL
以避免调用COALESCE
函数。函数本身很快,但是这种函数的使用通常使得无法使用索引。在删除函数调用的需要之后,您应该在两个表中的(key1, key2, id)
上添加索引。按此顺序排列三列的一个索引。使其成为独特的指数不会受到伤害。它可能会为优化器提供一些额外的提示。使用此索引ROW_NUMBER
应该能够生成所需的数字而无需额外的排序。拥有两组有序的数据也应该有助于加入。
我想重复一遍。只添加一个索引而不会使列NOT NULL
最有可能是无用的。