按间隔有效地连接表(棘手)

时间:2016-03-10 19:28:13

标签: sql postgresql

所以我有2个看起来像这样的表

___A___       _____B____
id | a        id | s | e
1  | 5        1  | 4 | 6
2  | 4        2  | 2 | 7
3  | 3        3  | 3 | 4
              4  | 1 | 5

表A并且分别具有大约1,500,000和200,000行。我希望以A.a所在的最小间隔加入表格。

这是我的查询,但速度很慢

select A.a,
       B.s,
       B.e
  from A
  join B
    on A.a > B.s
   and A.a < B.e
   and (B.e - B.s) = (
       select min(B.e - B.s)
         from B
        where A.a > B.s
          and A.a < B.e
   )

子查询用于确保我们使用最小的间隔。有没有办法让这个跑得更快?

由于

4 个答案:

答案 0 :(得分:0)

我不是postgresql专家,但您可以尝试使用CTE:

WITH A AS (
SELECT MIN(B.e - B.s) AS MinInterval
FROM #A AS A
     INNER JOIN #B AS B ON A.a > B.s AND A.a < B.e) , B AS 

(SELECT A.a
     , B.s
     , B.e
FROM #A AS A
     JOIN #B AS B ON A.a > B.s AND A.a < B.e
                     AND (B.e - B.s) = (SELECT MinInterval FROM A))
                 SELECT * FROM B;

结果:

enter image description here

答案 1 :(得分:0)

NOT EXISTS()版本有时可以避免聚合子查询:

SELECT a.a,
       b.s,
       b.e
  FROM AAAA a
  JOIN BBBB b
    ON a.a > b.s
   AND a.a < b.e
   AND NOT EXISTS ( SELECT *
        FROM BBBB nx
        WHERE a.a > nx.s
         AND a.a < nx.e
         AND (nx.e - nx.s) < (b.e - b.s)
   );

答案 2 :(得分:0)

使用RANK() window function使这一点变得相对简单:

SELECT ranked.id, ranked.val, ranked.start, ranked.end
FROM
(
    SELECT
        a.id,
        a.val,
        b.start,
        b.end,
        RANK() OVER (PARTITION BY a.id ORDER BY (b.end - b.start) ASC, b.id ASC) AS match_rank
    FROM a
    JOIN b
      ON a.val BETWEEN b.start AND b.end
) ranked
WHERE ranked.match_rank = 1

您找到所有匹配项,然后针对每个匹配项,根据b范围的小范围为其分配排名值。范围越小越好(使用b.id作为决胜局来防止重复)。然后,我们只保留每个a.id的最佳匹配。

SQL Fiddle demo

答案 3 :(得分:0)

按版本试用该小组:

select  A.a
      , B.s
      , B.e
from    A
join    B on A.a > B.s and     A.a < B.e
group by A.a
      , B.s
      , B.e
      , B.e - B.s
having (B.e - B.s) = min(B.e - B.s)