我一直在尝试进行一个相当复杂的SQL查询(也许很简单?)来压缩具有重复信息的表。我在SequelPro中使用MySQL 5.7.14。我是SQL的新手,对连接,联合等有基本的了解。我想为此需要一个带有一些组BY的子查询,但是我不知道如何做到最好。 下表说明了我正在尝试做的一个简单示例:
对于每个col_1重复的条目,当由col_2和3设置的范围(分别是范围的开始和结束)重叠时,我想压缩为单个条目。对于col_4和5,应该报告落入该范围的条目中的最大值。对于上面的示例,在col_1中,a的重叠范围为三个范围,我想将其压缩为col_1的最小值和col_2的最大值,以及col_4和5的最大值。对于col_2中的“ b”,存在两个范围(31-50,12-15)不会重叠,因此它将按原样返回两行。对于c,它将返回一行,范围为100-300,并且col_4和col_5的值分别为3、2。该示例所需的完整结果如下所示:
我应该补充一点,在某些地方存在“空”值,应将其视为零。 有没有人知道最好,最简单的方法? 预先谢谢你!
更新:我尝试使用建议的范围设置查询,但出现错误。查询如下:
WITH a AS (SELECT range
, lower(col_2) AS startdate
, max(upper(col_3)) OVER (ORDER BY range) AS `end`
FROM `combine`
)
, b AS (
SELECT *, lag(`end`) OVER (ORDER BY range) < `start` OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY range) AS grp
FROM b
)
SELECT daterange(min(`start`), max(`end`)) AS range
FROM c
GROUP BY grp
ORDER BY 1;
我收到的错误是: 您的SQL语法有误;查看与您的MySQL服务器版本相对应的手册,以在'a AS(SELECT range)附近使用正确的语法 ,较低(col_2)AS开始日期 ,第一行的max(upper(col_3))OVE'
答案 0 :(得分:0)
这并非微不足道,但可以在一个查询中完成。
困难的部分是将一组间隔合并为最大可能的连续间隔。解决方案在this post中有详细说明。
要获得想要的结果,您现在需要:
根据您的示例值,结果将是:
col_1 lower_bound upper_bound
a 20 60
b 12 15
b 31 50
c 100 300
将这些大间隔之一与your_table
中的每一行相关联。每行只能有一个这样的间隔,所以让我们INNER JOIN
:
SELECT my_table.*, large_intervals.lower_bound, large_intervals.upper_bound
FROM my_table
INNER JOIN (my_awesome_query(your_table)) large_intervals
ON large_intervals.col1 = my_table.col1
AND large_intervals.lower_bound <= my_table.col2
AND large_intervals.upper_bound >= my_table.col3
您会得到:
col1 col2 col3 col4 col5 lower_bound upper_bound
a 45 50 1 0 20 60
a 50 61 6 0 20 60
a 20 45 0 5 20 60
b 31 50 0 1 31 50
b 12 15 5 0 12 15
c 100 200 3 2 100 300
c 150 300 1 2 100 300
SELECT col1, lower_bound AS col2, upper_bound AS col3, MAX(col4) AS col4, MAX(col5) AS col5 FROM (query above) decorated_table GROUP BY col1, lower_bound, upper_bound
您将得到您想要的结果。
回到困难的地方:上面提到的帖子介绍了PostgreSQL的解决方案。 MySQL没有范围类型,但是解决方案可以调整。例如,代替lower(range)
,直接使用下界col2
。该解决方案还利用了窗口功能,即lag
和lead
,但是MySQL with the same syntax支持这种功能,因此这里没有问题。另请注意,它们使用COALESCE(upper(range), 'infinity')
来防止未绑定的范围。由于您的范围是有限的,因此您无需在意,可以直接使用上限,即col3
。这是改编:
WITH a AS (
SELECT
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (ORDER BY col2, col3) AS grp
FROM b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM c
GROUP BY grp
ORDER BY 1;
这适用于单个组。如果要按col1获取范围,可以这样调整:
WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM b
)
SELECT
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS range
FROM c
GROUP BY col1, grp
ORDER BY 1;
将所有内容组合在一起,我们得到以下内容(在您提供的示例中进行了测试),它完全返回您期望的输出:
WITH a AS (
SELECT
col1,
col2,
col3,
col2 AS lower_bound,
MAX(col3) OVER (PARTITION BY col1 ORDER BY col2, col3) AS upper_bound
FROM combine
)
, b AS (
SELECT *, lag(upper_bound) OVER (PARTITION BY col1 ORDER BY col2, col3) < lower_bound OR NULL AS step
FROM a
)
, c AS (
SELECT *, count(step) OVER (PARTITION BY col1 ORDER BY col2, col3) AS grp
FROM b
)
, large_intervals AS (
SELECT
col1,
MIN(lower_bound) AS lower_bound,
MAX(upper_bound) AS upper_bound
FROM c
GROUP BY col1, grp
ORDER BY 1
)
, combine_with_large_interval AS (
SELECT
combine.*,
large_intervals.lower_bound,
large_intervals.upper_bound
FROM combine
INNER JOIN large_intervals
ON large_intervals.col1 = combine.col1
AND large_intervals.lower_bound <= combine.col2
AND large_intervals.upper_bound >= combine.col3
)
SELECT
col1,
lower_bound AS col2,
upper_bound AS col3,
MAX(col4) AS col4,
MAX(col5) AS col5
FROM combine_with_large_interval
GROUP BY col1, lower_bound, upper_bound
ORDER BY col1, col2, col3;
Voilà!