我想从保留一个表的表中删除确切的重复记录。但是,我无法使用中间表方法,因为除了ID列之外,所有列都包含重复项。例如:
ID,
COL1,
Col2,
col3,
col4
The dups are on col1, col2, col3, col4
Below some samples:
ID COL1 COL2 COL3 COL4
123 ABC 4RTFD FGY 12346
234 ABC 4RTFD FGY 12346
586 ABC 4RTFD FGY 12346
这里只有Id列不同,其余四列是重复的。我想只保留最大ID列行。
我可以在这里使用什么方法?
谢谢, 阿米特
答案 0 :(得分:2)
尝试在所有列上加入表格,并且ID不同......
CREATE TABLE Dups
(
ID int IDENTITY(1,1) PRIMARY KEY,
Col1 int NOT NULL,
Col2 date NOT NULL,
Col3 char(1) NOT NULL,
Col4 char(1) NOT NULL
)
INSERT dbo.Dups (Col1,Col2,Col3,Col4)
VALUES ('1','20170925','A','Z'), ('1','20170925','A','Z'), ('1','20170925','A','Z'), ('2','20170925','A','Z'), ('2','20170925','A','Z'), ('2','20170925','A','Z'), ('3','20170925','A','Z');
SELECT * FROM Dups;
-- This solution to retain the first ID found that is duplicated...
DELETE FROM Dups
WHERE ID IN (
SELECT ID
FROM (
SELECT d1.ID,
row_number() OVER (ORDER BY d1.ID) AS DupSeq
FROM dbo.Dups AS d1
INNER JOIN dbo.Dups AS d2 ON d2.Col1 = d1.Col1 AND d2.Col2 = d1.Col2 AND d2.Col3 = d1.Col3 AND d2.Col4 = d1.Col4
WHERE d1.ID <> d2.ID
) AS t
WHERE DupSeq > 1
);
-- This solution to retain the last ID found that is duplicated...
DELETE FROM Dups
WHERE ID NOT IN (
SELECT DISTINCT
max(t.ID) OVER(PARTITION BY t.Col1,t.Col2,t.Col3,t.Col4 ORDER BY WindowOrder) AS KeepID
FROM (
SELECT d1.ID,
d1.Col1,
d1.Col2,
d1.Col3,
d1.Col4,
1 AS WindowOrder
FROM dbo.Dups AS d1
LEFT OUTER JOIN dbo.Dups AS d2 ON d2.Col1 = d1.Col1
AND d2.Col2 = d1.Col2
AND d2.Col3 = d1.Col3
AND d2.Col4 = d1.Col4
AND d1.ID <> d2.ID
) AS t
);
SELECT * FROM Dups;
DROP TABLE dbo.Dups
您需要在第一个解决方案中使用row_number(),因为ID1将与ID3匹配,因此ID3也将匹配ID1。
在第二个解决方案中,连接是LEFT OUTER以保留那些不重复的值。
答案 1 :(得分:0)
你可以做到,其他许多人以前在SQL-Server(和Teradata)做过的事情,请看这里How to delete duplicate rows in sql server?,或者即使没有像CTE这样的人也能做到这一点
DELETE FROM (
SELECT ROW_NUMBER()
OVER (PARTITION BY col1,col1,col3,col4
ORDER BY ID DESC) rn
FROM tbl -- tbl is "your" table ...
) t1 WHERE rn>1
它适用于SQL,还没有在teradata上测试过它,但是,由于ROW_NUMBER()
存在,我也希望它可以工作......
答案 2 :(得分:0)
您可以使用correlated subquery
和max
功能获得所需的结果,如下所示。
DELETE
FROM table1 t1
WHERE t1.Id <> (
SELECT max(t2.Id)
FROM table1 t2
WHERE t1.col1 = t2.col1
AND t1.col2 = t2.col2
AND t1.col3 = t2.col3
AND t1.col4 = t2.col4
);
上述查询假定table1
为您的表名。
select * from table1;
<强>结果:强>
ID Col1 Col2 Col3 Col4
---------------------------------
586 ABC 4RTFD FGY 12,346
您可以查看演示 *here
<强>更新强>
下面的行将添加到示例数据集中。
id col1 col2 col3 col4
----------------------------------
345 XYZ 4FTFD FGY 12346
745 XYZ 4FTFD FGY 12346
945 XYZ 4FTFD FGY 12346
<强>结果:强>
id col1 col2 col3 col4
-----------------------------------
586 ABC 4RTFD FGY 12346
945 XYZ 4FTFD FGY 12346
<强> DEMO 强>
的 *注意: 由于teradata在线演示工具无法使用,PostgreSQL演示已被用作PostgreSQL支持的相关子查询。在本地teradata环境中也模拟了查询。 的
答案 3 :(得分:0)
这不是分组功能的简单用法吗?
select max(ID) ID, COL1, COL2, COL3
from tableA
group by 2,3,4
并将其保存到新表中。如果需要从现有表中删除重复行,则可以执行以下delete语句:
delete from tableA as a1
where (
select 1 from (
select max(ID) ID, COL1, COL2, COL3 from tableA group by 2,3,4) a2
where a1.ID = a2.ID
and a1.COL1 = a2.COL1
and a1.COL2 = a2.COL2
and a1.COL3 = a2.COL3
) is null