Mysql查询中的复杂重叠

时间:2013-12-18 15:14:24

标签: mysql sql

这是我的问题,我有一个MYSQL表,其中包含以下列和数据示例:

id | user | starting date | ending date | activity code
1 | Andy | 2010-04-01 | 2010-05-01 | 3
2 | Andy | 1988-11-01 | 1991-03-01 | 3
3 | Andy | 2005-06-01 | 2008-08-01 | 3
4 | Andy | 2005-08-01 | 2008-11-01 | 3
5 | Andy | 2005-06-01 | 2010-05-01 | 4
6 | Ben  | 2010-03-01 | 2011-06-01 | 3
7 | Ben  | 2010-03-01 | 2010-05-01 | 4
8 | Ben  | 2005-04-01 | 2011-05-01 | 3

正如您在此表中所见,用户可以拥有相同的活动代码和类似的日期或句点。对于同一个用户,句点可以与其他句点重叠或不重叠。表格中也可能有几个重叠期。

我想要的是MYSQL QUERY以获得以下结果:

new id | user | starting date | ending date | activity code
1 | Andy | 2010-04-01 | 2010-05-01 | 3 => ok, no overlap period
2 | Andy | 1988-11-01 | 1991-03-01 | 3 => ok, no overlap period
3 | Andy | 2005-06-01 | 2008-11-01 | 3 => same user, same activity but ending date coming from row 4 as extended period 
4 | Andy | 2005-06-01 | 2010-05-01 | 4 => ok other activity code
5 | Ben  | 2005-04-01 | 2011-06-01 | 3 => ok other user, but as overlap period rows 6 and 8 for the same user and activity, I take the widest range
6 | Ben  | 2010-03-01 | 2010-05-01 | 4 => ok other activity for second user

换句话说,对于相同的用户和活动代码,如果没有重叠,我需要按原样开始和结束日期。如果同一用户和活动代码存在重叠,我需要较低的开始日期和较高的结束日期来自不同的相关行。我需要这个表用于表的所有用户和活动代码以及用于MYSQL的SQL。

我希望它足够清楚,有人可以帮助我,因为我尝试使用本网站提供的解决方案中的不同代码而其他人没有成功。

2 个答案:

答案 0 :(得分:0)

我有点复杂(严格地说是MySQL特定的)解决方案:

SET @user = NULL;
SET @activity = NULL;
SET @interval_id = 0;

SELECT
  MIN(inn.`starting date`) AS start,
  MAX(inn.`ending date`) AS end,
  inn.user,
  inn.`activity code`
  FROM
    (SELECT
       IF(user <> @user OR `activity code` <> @activity,  
          @interval_id := @interval_id  + 1, NULL),
       IF(user <> @user OR `activity code` <> @activity,  
          @interval_end := STR_TO_DATE('',''), NULL),
       @user := user,
       @activity := `activity code`,
       @interval_id := IF(`starting date` > @interval_end,
                          @interval_id + 1,
                          @interval_id) AS interval_id,
       @interval_end := IF(`starting date` < @interval_end,
                           GREATEST(@interval_end, `ending date`),
                           `ending date`) AS interval_end,
       t.*
     FROM Table1 t
     ORDER BY t.user, t.`activity code`, t.`starting date`, t.`ending date`) inn
GROUP BY inn.user, inn.`activity code`, inn.interval_id;

this question的第一个答案中无耻地借用了潜在的想法。

您可以使用此SQL Fiddle查看结果并尝试不同的源数据。

答案 1 :(得分:0)

这是一个解决方案 - (见http://sqlfiddle.com/#!2/fda3d/15

SELECT DISTINCT summarized.`user`
  , summarized.activity_code
  , summarized.true_begin
  , summarized.true_end
FROM (
  SELECT t1.id,t1.`user`,t1.activity_code
    , MIN(LEAST(t1.`starting`, COALESCE(overlap.`starting` ,t1.`starting`))) as true_begin
    , MAX(GREATEST(t1.`ending`, COALESCE(overlap.`ending` ,t1.`ending`))) as true_end
  FROM t1
  LEFT JOIN t1 AS overlap
    ON t1.`user` = overlap.`user`
      AND t1.activity_code = overlap.activity_code
      AND overlap.`ending` >= t1.`starting`
      AND overlap.`starting` <= t1.`ending`
      AND overlap.id <> t1.id
  GROUP BY t1.id, t1.`user`, t1.activity_code) AS summarized;

我不确定它对具有多个重叠的大型数据集的性能如何。你肯定需要一个关于user和activity_code字段的索引 - 可能是起始和结束日期字段也是该索引的一部分。