我正在使用SQL Server 2008.我有一个表
Customers
customer_number int
field1 varchar
field2 varchar
field3 varchar
field4 varchar
...还有更多专栏,对我的查询无关紧要。
列 customer_number 是pk。我试图找到重复的值和它们之间的一些差异。
请帮我找到所有相同的行
1) field1,field2,field3,field4
2)只有3列相等而其中一列不相同(列表1中的行除外)
3)只有2列相等且其中两列不相符(列表1和列表2中的行除外)
最后,我将有3个表格,其中包含此结果和其他groupId,对于一组相似的组,它们将是相同的(例如,对于3列等于,具有3个相同列的行将是一个单独的组)
谢谢。
答案 0 :(得分:56)
这是一个方便的查询,用于查找表中的重复项。假设您要查找表中存在多个的所有电子邮件地址:
SELECT email, COUNT(email) AS NumOccurrences
FROM users
GROUP BY email
HAVING ( COUNT(email) > 1 )
您还可以使用此技术查找仅出现一次的行:
SELECT email
FROM users
GROUP BY email
HAVING ( COUNT(email) = 1 )
答案 1 :(得分:4)
最简单的可能是编写一个存储过程来迭代重复的每组客户,并分别为每个组号插入匹配的。
但是,我已经考虑过了,你可以用子查询做到这一点。希望我没有比它应该更复杂,但这应该得到你正在寻找的第一个重复表(所有四个字段)。请注意,这是未经测试的,因此可能需要稍微调整一下。
基本上,它会获得每组字段,其中有重复项,每组都有一个组号,然后获取所有这些字段的客户并分配相同的组号。
INSERT INTO FourFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY c.field1) AS group_no,
c.field1, c.field2, c.field3, c.field4
FROM Customers c
GROUP BY c.field1, c.field2, c.field3, c.field4
HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON custs.field1 = Groups.field1
AND custs.field2 = Groups.field2
AND custs.field3 = Groups.field3
AND custs.field4 = Groups.field4
其他的有点复杂,但是你需要扩大可能性。然后是三个字段组:
INSERT INTO ThreeFieldsDuplicates(group_no, customer_no)
SELECT Groups.group_no, custs.customer_no
FROM (SELECT ROW_NUMBER() OVER(ORDER BY GroupsInner.field1) AS group_no,
GroupsInner.field1, GroupsInner.field2,
GroupsInner.field3, GroupsInner.field4
FROM (SELECT c.field1, c.field2, c.field3, NULL AS field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field1, c.field2, c.field3
UNION ALL
SELECT c.field1, c.field2, NULL AS field3, c.field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field1, c.field2, c.field4
UNION ALL
SELECT c.field1, NULL AS field2, c.field3, c.field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field1, c.field3, c.field4
UNION ALL
SELECT NULL AS field1, c.field2, c.field3, c.field4
FROM Customers c
WHERE NOT EXISTS(SELECT d.customer_no
FROM FourFieldsDuplicates d
WHERE d.customer_no = c.customer_no)
GROUP BY c.field2, c.field3, c.field4) GroupsInner
GROUP BY GroupsInner.field1, GroupsInner.field2,
GroupsInner.field3, GroupsInner.field4
HAVING COUNT(*) > 1) Groups
INNER JOIN Customers custs ON (Groups.field1 IS NULL OR custs.field1 = Groups.field1)
AND (Groups.field2 IS NULL OR custs.field2 = Groups.field2)
AND (Groups.field3 IS NULL OR custs.field3 = Groups.field3)
AND (Groups.field4 IS NULL OR custs.field4 = Groups.field4)
希望这会产生正确的结果,我会留下最后一个作为练习。 :-D
答案 2 :(得分:2)
我不确定您是否需要对不同字段进行相等检查(例如field1 = field2) 否则这可能就足够了。
修改
随意调整testdata,为我们提供根据您的规格输出错误的输入。
测试数据
DECLARE @Customers TABLE (
customer_number INTEGER IDENTITY(1, 1)
, field1 INTEGER
, field2 INTEGER
, field3 INTEGER
, field4 INTEGER)
INSERT INTO @Customers
SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, 1
UNION ALL SELECT 1, 1, 1, NULL
UNION ALL SELECT 1, 1, 1, 2
UNION ALL SELECT 1, 1, 1, 3
UNION ALL SELECT 2, 1, 1, 1
全部等于
SELECT ROW_NUMBER() OVER (ORDER BY c1.customer_number)
, c1.field1
, c1.field2
, c1.field3
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND ISNULL(c2.field1, 0) = ISNULL(c1.field1, 0)
AND ISNULL(c2.field2, 0) = ISNULL(c1.field2, 0)
AND ISNULL(c2.field3, 0) = ISNULL(c1.field3, 0)
AND ISNULL(c2.field4, 0) = ISNULL(c1.field4, 0)
一个字段不同
SELECT ROW_NUMBER() OVER (ORDER BY field1, field2, field3, field4)
, field1
, field2
, field3
, field4
FROM (
SELECT DISTINCT c1.field1
, c1.field2
, c1.field3
, field4 = NULL
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND c2.field1 = c1.field1
AND c2.field2 = c1.field2
AND c2.field3 = c1.field3
AND ISNULL(c2.field4, 0) <> ISNULL(c1.field4, 0)
UNION ALL
SELECT DISTINCT c1.field1
, c1.field2
, NULL
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND c2.field1 = c1.field1
AND c2.field2 = c1.field2
AND ISNULL(c2.field3, 0) <> ISNULL(c1.field3, 0)
AND c2.field4 = c1.field4
UNION ALL
SELECT DISTINCT c1.field1
, NULL
, c1.field3
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND c2.field1 = c1.field1
AND ISNULL(c2.field2, 0) <> ISNULL(c1.field2, 0)
AND c2.field3 = c1.field3
AND c2.field4 = c1.field4
UNION ALL
SELECT DISTINCT NULL
, c1.field2
, c1.field3
, c1.field4
FROM @Customers c1
INNER JOIN @Customers c2 ON c2.customer_number > c1.customer_number
AND ISNULL(c2.field1, 0) <> ISNULL(c1.field1, 0)
AND c2.field2 = c1.field2
AND c2.field3 = c1.field3
AND c2.field4 = c1.field4
) c
答案 3 :(得分:0)
您可以简单地编写类似的内容来计算重复条目,我认为它正在起作用:
use *DATABASE_NAME*
go
SELECT *YOUR_FIELD*, COUNT(*) AS dupes
FROM *YOUR_TABLE_NAME*
GROUP BY *YOUR_FIELD*
HAVING (COUNT(*) > 1)
享受
答案 4 :(得分:0)
使用CUBE()
执行此操作有一种干净的方式,它将通过列的所有可能组合进行聚合
SELECT
field1,field2,field3,field4
,duplicate_row_count = COUNT(*)
,grp_id = GROUPING_ID(field1,field2,field3,field4)
INTO #duplicate_rows
FROM table_name
GROUP BY CUBE(field1,field2,field3,field4)
HAVING COUNT(*) > 1
AND GROUPING_ID(field1,field2,field3,field4) IN (0,1,2,4,8,3,5,6,9,10,12)
数字(0,1,2,4,8,3,5,6,9,10,12)只是位掩码(0000,0001,0010,0100,...,1010,1100)我们关心的分组集 - 有4个,3个或2个匹配的分组。
然后使用将#duplicate_rows中的NULL视为通配符的技术将其连接回原始表
SELECT a.*
FROM table_name a
INNER JOIN #duplicate_rows b
ON NULLIF(b.field1,a.field1) IS NULL
AND NULLIF(b.field2,a.field2) IS NULL
AND NULLIF(b.field3,a.field3) IS NULL
AND NULLIF(b.field4,a.field4) IS NULL
--WHERE grp_id IN (0) --Use this for 4 matches
--WHERE grp_id IN (1,2,4,8) --Use this for 3 matches
--WHERE grp_id IN (3,5,6,9,10,12) --Use this for 2 matches