分区表但基于多个列组合在一起

时间:2013-03-21 00:32:44

标签: sql sql-server sql-server-2008 tsql sqlperformance

将表分成组时,我遇到了一个有趣的问题。我有一群游客 - 每个人都说一种语言和/或是一个家庭的一部分。 我需要将表格分组,但我希望将家庭和类似语言的人保持在一起。

假设我想将游客分成最多3个小组(如果小组必须更大,那是可以接受的)。解决方案不一定非常聪明,它完全填满所有组,但我正在尽最大努力。

输入:

TouristID | LanguageID | FamilyID
---------------------------------
    1     |     1      |    1
    2     |     1      |    1
    3     |     1      |    1
    4     |     2      |    1
    5     |     3      |    2
    6     |     4      |    2
    7     |     5      |    3
    8     |     5      |    4
    9     |     7      |    5

期望的结果:

TouristID | GroupID
-------------------
    1     |    1
    2     |    1
    3     |    1
    4     |    1
    5     |    2
    6     |    2
    7     |    3
    8     |    3
    9     |    2

第1组由所有语言1发言人组成,包括一个不能被遗漏的家庭成员。

第2组由两个家庭成员(5,6)和一个随机成员(9)组成,成为3人组。

第3组由两个相同的语言发言者(7,8)组成

我做了什么

INSERT TouristGroup
SELECT
  t.TouristID,
  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
FROM Tourists t
CROSS APPLY (
  SELECT MIN(TouristID) AS [GroupID]
  FROM Tourists t2
  WHERE
    ( t2.LanguageID = t.LanguageID
    OR t2.FamilyID = t.FamilyID )
) x;

INSERT Groups
SELECT GroupID, COUNT(*)
FROM TouristGroup
GROUP BY GroupID;

declare 
  @matchID int = 0,
  @currentCount int,
  @desiredCount int = 0,
  @candidateGroupID int = null,
  @chunk int = 1

while exists (
  select null
  from Groups g
  left join Matches m
    on m.GroupID = g.GroupID
  where m.GroupID is null
)
begin
  set @currentCount = null
  set @candidateGroupID = null

  select
    @currentCount = isnull(SUM([Count]), 0)
  from Matches m
  join Groups g
    on g.GroupID = m.GroupID
  where m.MatchID = @matchID

  if @CurrentCount is not null
  begin
    set @desiredCount = @chunk - @desiredCount

    select top 1
      @candidateGroupID = g.GroupID
    from Groups g
    left join Matches m
      on m.GroupID = g.GroupID
    where g.[Count] <= @desiredCount
      and m.GroupID is null
    order by [Count] DESC

    if @candidateGroupID is not null
    begin
      insert Matches
      select @matchID, @candidateGroupID
    end
    else begin
      set @matchID = @matchID + 1
    end
  end
  else begin
    set @matchid = @matchID + 1
  end
end         

问题

是否有更好的方法来分区表,但是根据多列将行分组在一起?

1 个答案:

答案 0 :(得分:1)

这将产生你的“第1步”。也许它比你现在拥有的更好(没有循环)。

SELECT t.TouristID, DENSE_RANK() OVER (ORDER BY x.GroupNum) as GroupId
FROM Tourists t
CROSS APPLY (SELECT MIN(TouristId) AS GroupNum 
             FROM @Tourist t2 
             WHERE t2.LanguageId = t.LanguageId OR t2.FamilyId = t.FamilyId
            ) x

至于你至少得到至少三个小组成员的其他要求,如果可能的话,你可能需要做一个类似于你正在做的循环(我不确定它是否可以改进,因为你没有分享它。)

[更新]以下是我对“第2步”的建议:

DECLARE @MinGroupSize int = 3, @rc int = 1
WHILE @rc>0
BEGIN
    WITH GroupCount AS (
    SELECT GroupID, COUNT(*) AS GroupCount
    FROM TouristGroup
    GROUP BY GroupID
    ), CandidateGroups AS (
    SELECT TOP 1 gc1.GroupID AS ShortGroupId, singleton.GroupID as SingletonGroupID
    FROM GroupCount gc1
    CROSS APPLY (SELECT TOP 1 GroupID
                 FROM GroupCount AS gc2
                 WHERE gc2.GroupCount = 1 AND gc2.GroupID != gc1.GroupID
                 ORDER BY gc2.GroupID
                 ) AS singleton
    WHERE gc1.GroupCount < @MinGroupSize
    ORDER BY GroupCount DESC, gc1.GroupID ASC
    )
    UPDATE tg
    SET GroupID = cg.ShortGroupID
    FROM TouristGroup tg
    JOIN CandidateGroups cg ON cg.SingletonGroupID = tg.GroupID;
    SET @rc = @@ROWCOUNT;
END
--
-- If you're anal like me and want to eliminate gaps in GroupID values
--
UPDATE tg
SET GroupID = tg2.GroupID
FROM TouristGroup tg
JOIN (SELECT TouristID,  DENSE_RANK() OVER (ORDER BY GroupID) AS [GroupID]
      FROM TouristGroup) AS tg2 ON tg2.TouristID = tg.TouristID
WHERE tg.GroupID != tg2.GroupID;

这将找到小于所需最小组大小的组并找到单个组(仅1个成员)并使用另一个GroupID更新单例,逐个执行此操作,直到没有其他候选者为止。按顺序选择较小的组(按GroupCount降序,然后按GroupID升序),以便首先填充较大的组。只选择单身人士进行更新,以便不会破坏任何自然群体。