识别具有相同记录的所有行

时间:2018-04-13 10:01:56

标签: sql sql-server

我有一张如下所示的表格。

我希望能够识别具有相同Custodians的所有不同MD5hash

结果应该是ArtifactIDCustodian ID作为新行。例如:

1098647, 1098624
1098648, 1098717
1098648, 1098624
1098647, 1098717

表格如下:

ArtifactID      md5Hash                             Custodian
1098647         e6ae2fbc906c42b55d25f6d660f4913a    1098624
1098648         e6ae2fbc906c42b55d25f6d660f4913a    1098717
1098649         9f0c88c40be3d01b6beed39b32dea3fb    1098624
1098650         39446d6f0a5b29fef001c184797349b4    1098624
1098651         35ec5012284256c97553b5342fd59530    1098624
1098652         0914cd30b41460efaab7d6703444a5de    1098624
1098653         929eefb170bc74ed3cfabae969a032ed    1098624
1098654         d8986a76130fde673bbf5f1f9fb82857    1098624
1098655         6399df1a2ca3fde7021da25e4aa9e722    1098624
1098656         a19701c034af4094bc3da149d1e9b8d1    1098624
1098657         8384d8e0562391ee02c731fc059b510c    1098624
1098658         94800202b4473f8ce3dc08ddea4aff0c    1098624
1098659         87388b9895c749147d5a19a8ccd9c865    1098624

2 个答案:

答案 0 :(得分:2)

首先确定哪些哈希值与不同的保管人重复,然后检索这些保管人。

编辑:您希望的结果似乎涉及存储在表格中的隐式关系。我尝试在以下CTE中区分这种关系。这应该得到你所需要的。

IF OBJECT_ID('tempdb..#Data') IS NOT NULL
    DROP TABLE #Data

CREATE TABLE #Data (
    ArtifactID INT,
    md5Hash VARCHAR(200),
    Custodian INT)

INSERT INTO #Data (
    ArtifactID,
    md5Hash,
    Custodian)
VALUES
    (1098647, 'e6ae2fbc906c42b55d25f6d660f4913a', 1098624), 
    (1098648, 'e6ae2fbc906c42b55d25f6d660f4913a', 1098717), 
    (1098649, '9f0c88c40be3d01b6beed39b32dea3fb', 1098624), 
    (1098650, '39446d6f0a5b29fef001c184797349b4', 1098624), 
    (1098651, '35ec5012284256c97553b5342fd59530', 1098624), 
    (1098652, '0914cd30b41460efaab7d6703444a5de', 1098624), 
    (1098653, '929eefb170bc74ed3cfabae969a032ed', 1098624), 
    (1098654, 'd8986a76130fde673bbf5f1f9fb82857', 1098624), 
    (1098655, '6399df1a2ca3fde7021da25e4aa9e722', 1098624), 
    (1098656, 'a19701c034af4094bc3da149d1e9b8d1', 1098624), 
    (1098657, '8384d8e0562391ee02c731fc059b510c', 1098624), 
    (1098658, '94800202b4473f8ce3dc08ddea4aff0c', 1098624), 
    (1098659, '87388b9895c749147d5a19a8ccd9c865', 1098624)

;WITH Artifacts AS
(
    SELECT DISTINCT
        D.ArtifactID,
        D.md5Hash
    FROM
        #Data AS D
),
Custodians AS
(
    SELECT DISTINCT
        D.Custodian,
        D.md5Hash
    FROM
        #Data AS D
),
RepeatedHash AS
(
    SELECT
        T.md5Hash
    FROM
        Custodians AS T
    GROUP BY
        T.md5Hash
    HAVING
        COUNT(DISTINCT(T.Custodian)) > 1
)
SELECT
    A.ArtifactID,
    C.Custodian
FROM
    RepeatedHash AS R
    INNER JOIN Custodians AS C ON R.md5Hash = C.md5Hash
    INNER JOIN Artifacts AS A ON R.md5Hash = A.md5Hash

答案 1 :(得分:1)

您可以在md5Hash字段上自行加入表格。子查询将按md5Hash字段对记录进行分组,并仅返回重复的记录:

SELECT ArtifactID, Custodian
FROM table1 t
INNER JOIN (SELECT md5Hash
            FROM table1
            GROUP BY md5Hash
            HAVING COUNT(*) > 1
           ) tt ON t.md5Hash = tt.md5Hash

编辑您的更新表明您的表格未正确规范化。强烈建议您对表格进行标准化。要使用当前的表格设计获得所需的结果,您需要像上面的那个子查询,一个用于ArtifactID的{​​{1}}和另一个用md5Hash Custodian的子查询,然后你可以在隐式关系md5Hash上加入两个:

md5Hash