Question

问题

我们有一个重复客户编号表：

A varchar(16) NOT NULL,
B varchar(16) NOT NULL

这些列以旧的和新的（删除和保留）开始，但是转移到两者都不是首选的。这些列实际上只是“A”和“B” - 同一客户的两个数字，任何顺序。

此外，该表可以为同一客户提供任意数量的对。你可能会看到像

这样的行

a,b
b,c

意思是a，b，c都是针对同一个客户的。您可能还会看到

之类的行

a,b
b,a
c,a

意思是a，b，c都是同一个客户。

不一个干净的非循环表示，如“旧”和“新”值。客户的客户ID列表在此表中以一行或多行的块表示，其中唯一的连接是一行中A或B列的值可能显示在其他行的A或B列中。我的任务是将它们全部组合到每个客户的列表中。

我想把这个烂摊子改成像

这样的东西

MasterKey int NOT NULL,
CustNum varchar(16) NOT NULL UNIQUE,
PRIMARY KEY( MasterKey, CustNum )

客户的一个或多个号码将共享此表中的MasterKey。正如UNIQUE约束所说，给定的CustNum不能出现多次。

例如，原始

中的行就像这样

1a,1b
1b,1c
2a,2b
2b,2c
2d,2a
...

应该在新表中以这样的行结束

1 1a
1 1b
1 1c
2 2a
2 2b
2 2c
2 2d
...

编辑：上面的值只是为了使模式清晰。实际的客户编号值是任意varchar s。

我尝试过的解决方案

这感觉就像递归的工作，因此是一个CTE。但源数据的潜在循环性质使我很难得到锚案例。我试图将它预先清理成更多的非循环形式，但我似乎仍然无法做到这一点。

我也顽固地尝试将其作为基于集合的SQL操作，而不是诉诸游标和循环。但也许那是不可能的。

我花了8个小时思考这个并尝试不同的方法，但它一直在滑落。关于正确方法的任何想法或建议，甚至是一些示例代码？

Answer 1

给出输入数据：

a,b
b,c
d,e
e,f
g,d

我会添加两个新表，一个包含pk值，另一个包含pk和重复值与pks的一对多关系，如下所示：

pk
a
b
c
d
e
f
g


pk dup
a   b   
b   a
b   c
c   b   
d   e
e   d
e   f
f   e   
g   d
d   g

pk / dup表中的

行由输入文件填充，其中pks和duplicates插入（pk，dup）序列和（dup，pk）序列。

这可以让你获得密钥和重复之间的第一组关系，但是你需要再次遍历这个集合以获得间接关系，比如'c是'和'的重复'

您可以通过自行加入pkdup1.dup = pkdup2.pk上的pk / dup表来获得这些关系。这将行（a，b）与行（b，a）和（b，c）连接起来，允许您识别关系（a，c）。它也会拾取（d，f）（f，d）（g，e）。你需要重复迭代来拾取（g，f）

HTH

Answer 2

我认为你必须做一些循环。在这里，我一次看一行，以确保我获得属于单个masterkey的所有链式值。

while (1=1)
begin

    -- get the next key that is not inserted yet as MasterKey or key
    select top 1 @masterKey = a
    from myTable 
    where not exists (select 1
        from #temp
        where #temp.MasterKey = myTable.a
        or #temp.Key = myTable.a)

    if(@masterKey is null) -- out of a's so now work the b's
        select top 1 @masterKey = b
        from myTable 
        where not exists (select 1
            from #temp
            where #temp.MasterKey = myTable.b
            or #temp.Key = myTable.b)

    if(@masterKey is null) -- totally done.
        break

    insert into #temp
    (masterKey, key)
    values(@masterKey, @masterKey)


    while (1=1) -- now insert anything new with this masterKey in a
    begin
        insert into #temp
        select top 1 @masterKey, myTable.b
        from myTable
        where myTable.a = @masterKey
        not exists (select 1
        from #temp
        where #temp.MasterKey = myTable.b
        or #temp.Key = myTable.b))

        if @@rowcount < 1
            break
    end 


    while (1=1) -- now insert anything with this masterKey in b
    begin
        insert into #temp
        select top 1 @masterKey, myTable.a
        from myTable
        where myTable.b = @masterKey
        not exists (select 1
        from #temp
        where #temp.MasterKey = myTable.a
        or #temp.Key = myTable.a))

        if @@rowcount < 1
            break

    end 

end

你实际上必须将2个插入部分包装到另一个循环中以确保在获得下一个masterKey之前它已经耗尽，但是你明白了。

Answer 3

看起来像工作的工作给我。下面的代码假设您不能在同一记录中使用1a，2b。

create table #temp（a varchar（10），b varchar（10））

insert into #temp
values ('1a', '1b')
,('1b', '1c')
,('2a', '2b')
,('2b', '2c')
,('2d', '2a')

select * from #temp

select a, b, left (a, 1) as id into #temp2 from #temp

select id, a from #temp2 
union 
select id, b from #temp2

Answer 4

找到密钥的模式是什么？如果它只是字符串中的第一个数字，那么这将把它拉出来：

select substring('FOO12',patindex('%[0-9]%','FOO12'),100)

如果它以数字开头，则会将其拉出来：

select substring('12FOO',1,patindex('%[A-Z]%','12FOO')-1)

两者都返回12。

Answer 5

根据评论中的一些示例数据，我认为这应该可以解决问题吗？

CREATE TABLE #sample
(A NVARCHAR(50)
,B NVARCHAR(50))

INSERT INTO #sample VALUES('FOO12','12DEF')
INSERT INTO #sample VALUES('12GHJ','12ABC')
INSERT INTO #sample VALUES('GURGLE721','GURGLZ721')
INSERT INTO #sample VALUES('word21','book721')
INSERT INTO #sample VALUES('orange21','apple21')

;WITH CTE as
(
SELECT A
,PATINDEX('%[A-Za-z]%',A) as text_start
,PATINDEX('%[0-9]%',A) as num_start
FROM #sample
UNION ALL
SELECT B
,PATINDEX('%[A-Za-z]%',B) as text_start
,PATINDEX('%[0-9]%',B) as num_start
FROM #sample
)
,cte2 AS
(
SELECT
*
,CASE WHEN text_start > num_start --Letters after numbers
    THEN SUBSTRING(A,text_start - num_start + 1,99999)
    WHEN text_start = 1 --Letters at start of string
    THEN SUBSTRING(A,1,num_start - 1)
    END AS letters
,CASE WHEN num_start > text_start --Numbers after letters
    THEN SUBSTRING(A,num_start - text_start + 1,99999)
    WHEN num_start = 1 --Numbers at start of string
    THEN SUBSTRING(A,1,text_start- 1)
    END AS numbers
FROM cte
)
SELECT DISTINCT
DENSE_RANK() OVER (ORDER BY numbers ASC) as group_num
,numbers + letters as cust_details
FROM cte2
ORDER BY numbers + letters asc

Answer 6

我要做一些我以前没做过的事情，然后发一个答案我自己的问题。我需要对Beth和JBrooks表示衷心的感谢让我朝着正确的方向前进我真的想解决这个问题以基于集合的声明方式。也许这仍然可以使用 CTE和递归。但是，一旦我投降并说它可以这是必要的和迭代的，这样做要容易得多。

无论如何，根据我的问题给出这个目标表：

CREATE TABLE UniqueCustomers
(
    uid     int NOT NULL,
    gpid    varchar(16) NOT NULL UNIQUE, -- Important: UNIQUE to disallow duplicates
    PRIMARY KEY( uid, gpid ) -- Important: Disallow duplicates
)

我提出了以下存储过程。它可以在什么时候调用据报道，新的骗局逐一报道。它也可以在循环中调用在遗留表中以随机方式存储dupes作为对顺序。

CREATE PROCEDURE ReportDuplicateCustomerIDs
(
    @id1 varchar(16),
    @id2 varchar(16)
)
AS
BEGIN
    IF @id1 <> @id2
    BEGIN
        -- Retrieve the uid (if any) for each of the ids
        DECLARE @uid1 int
        SELECT @uid1 = NULL
        SELECT @uid1 = uid FROM UniqueCustomers WHERE gpid = @id1

        DECLARE @uid2 int
        SELECT @uid2 = NULL
        SELECT @uid2 = uid FROM UniqueCustomers WHERE gpid = @id2

        -- If we've seen NEITHER of the id's yet
        IF @uid1 IS NULL AND @uid2 IS NULL
        BEGIN
            -- Add both of them using a brand-new uid
            DECLARE @uidNew int
            SELECT @uidNew = Max(uid) + 1 FROM UniqueCustomers
            IF @uidNew IS NULL
                SET @uidNew = 0
            INSERT INTO UniqueCustomers VALUES( @uidNew, @id1 )
            INSERT INTO UniqueCustomers VALUES( @uidNew, @id2 )
        END
        ELSE
        BEGIN
            -- If we've seen BOTH id's already
            IF @uid1 IS NOT NULL AND @uid2 IS NOT NULL
            BEGIN
                -- If this pair bridges two existing chains.
                IF @uid1 <> @uid2
                BEGIN
                    -- Update everything using uid2 to use uid1 instead.
                    -- Consolidates the two dupe chains into one.
                    UPDATE UniqueCustomers SET uid = @uid1 WHERE uid = @uid2
                END
                -- ELSE nothing to do
            END
            ELSE
                -- If we've seen only id1, then insert id2 using
                -- the same uid that id1 is already using
                IF @uid1 IS NOT NULL
                    INSERT INTO UniqueCustomers VALUES( @uid1, @id2 )
                -- If we've seen only id2, then insert id1 using
                -- the same uid that id2 is already using
                ELSE -- @uid2 IS NOT NULL
                    INSERT INTO UniqueCustomers VALUES( @uid2, @id1 )
        END
    END
END
GO

如何将列对值的并集转换为线性表？

6 个答案: