有效查询合并超过2个子查询

时间:2012-04-23 00:59:27

标签: python postgresql psycopg2

我有一个

的数据库
books          (primary key: bookID)
characterNames (foreign key: books.bookID) 
locations      (foreign key: books.bookID)

字符名称和位置的文本位置保存在相应的表格中 我正在使用psycopg2编写一个Pythonscript,查找书中给定角色名称和位置的所有出现。我只希望书中出现,其中包括角色名称和位置 Here我已经找到了搜索一个位置和一个角色的解决方案:

WITH b AS (  
    SELECT bookid  
    FROM   characternames  
    WHERE  name = 'XXX'  
    GROUP  BY 1  
    INTERSECT  
    SELECT bookid  
    FROM   locations  
    WHERE  l.locname = 'YYY'  
    GROUP  BY 1  
    )  
SELECT bookid, position, 'char' AS what  
FROM   b  
JOIN   characternames USING (bookid)  
WHERE  name = 'XXX'  
UNION  ALL  
SELECT bookid, position, 'loc' AS what  
FROM   b  
JOIN   locations USING (bookid)  
WHERE  locname = 'YYY'  
ORDER  BY bookid, position;  

CTE'b'包含所有bookid,其中出现字符名称“XXX”和位置“YYY”。

现在我还想知道搜索2个地方和名字(或分别是2个名字和地点)。如果所有搜索过的实体必须出现在一本书中,那很简单,但是这样做:
正在寻找:Tim,Al,Toolshop 结果:书籍包括
(Tim,Al,Toolshop)或
(Tim,Al)或
(Tim,Toolshop)或
(Al,Toolshop)

问题可能会重复4,5,6 ......条件 关于交叉更多子查询,我认为这不起作用 取而代之的是UNION找到的bookIDs,GROUP它们并选择bookid发生一次以上:

WITH b AS (  
    SELECT bookid, count(bookid) AS occurrences  
    FROM  
        (SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='XXX'  
        UNION  
        SELECT DISTINCT bookid  
        FROM characterNames  
        WHERE name='YYY'  
        UNION  
        SELECT DISTINCT bookid  
        FROM locations  
        WHERE locname='ZZZ'  
        GROUP BY bookid)  
    WHERE occurrences>1)  

我觉得这个有效,目前无法测试,但这是最好的方法吗?

1 个答案:

答案 0 :(得分:4)

对广义情况使用计数的想法是合理的。但是,对语法进行了几处调整:

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='XXX'  

      UNION ALL  
      SELECT DISTINCT bookid  
      FROM   characterNames  
      WHERE  name='YYY'  

      UNION ALL
      SELECT DISTINCT bookid  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING count(*) > 1
   )
SELECT bookid, position, 'char' AS what
FROM   b
JOIN   characternames USING (bookid)
WHERE  name = 'XXX'

UNION  ALL
SELECT bookid, position, 'loc' AS what
FROM   b
JOIN   locations USING (bookid)
WHERE  locname = 'YYY'
ORDER  BY bookid, position;

注释

  • 使用UNION ALL(不是UNION)来保留子查询之间的重复项。在这种情况下,您希望它们能够计算它们。

  • 子查询应该产生不同的值。它与DISTINCT的方式一起使用。您可能需要尝试GROUP BY 1,看看它是否表现更好(我不指望它。)

  • GROUP BY可以超出子查询。它只会应用于最后一个子查询,因为您已经DISTINCT bookid已经没有任何意义。

  • 检查书上是否有多个点击必须进入HAVING条款:

     HAVING count(*) > 1
    

    您不能在WHERE子句中使用汇总值。


在一张桌子上组合条件

您不能简单地在一个表上组合多个条件。你如何计算研究结果的数量?但是有一种更复杂的方式。可能会或可能不会提高性能,您必须进行测试(使用EXPLAIN ANALYZE)。两个查询都需要对表characterNames进行至少两次索引扫描。至少它缩短了语法。

考虑我如何计算characterNames的点击次数以及我如何在外部sum(hits)中更改为SELECT

WITH b AS (  
   SELECT bookid
   FROM  (
      SELECT bookid
           , max((name='XXX')::int)
           + max((name='YYY')::int) AS hits
      FROM   characterNames  
      WHERE  (name='XXX'
           OR name='YYY')
      GROUP  BY bookid

      UNION ALL
      SELECT DISTINCT bookid, 1 AS hits  
      FROM   locations  
      WHERE  locname='ZZZ'  
      ) x
   GROUP  BY bookid
   HAVING sum(hits) > 1
   )
...

boolean转换为integer会为0提供FALSE,为1提供TRUE。这有帮助。


使用EXISTS

更快

当我骑自行车到我的公司时,这个东西一直在我脑后踢。我有理由相信这个查询可能会更快。请试一试:

WITH b AS (  
   SELECT bookid

        , (EXISTS (
            SELECT *
            FROM   characterNames c
            WHERE  c.bookid = b.bookid
            AND    c.name = 'XXX'))::int
        + (EXISTS (
            SELECT *
            FROM   characterNames c
            WHERE  c.bookid = b.bookid
            AND    c.name = 'YYY'))::int AS c_hits

        , (EXISTS (
            SELECT *
            FROM   locations l
            WHERE  l.bookid = b.bookid
            AND    l.locname='ZZZ'))::int AS l_hits
   FROM   books b  
   WHERE  (c_hits + l_hits) > 1
   )
SELECT c.bookid, c.position, 'char' AS what
FROM   b
JOIN   characternames c USING (bookid)
WHERE  b.c_hits > 0
AND    c.name IN ('XXX', 'YYY')

UNION  ALL
SELECT l.bookid, l.position, 'loc' AS what
FROM   b
JOIN   locations l USING (bookid)
WHERE  b.l_hits > 0
AND    l.locname = 'YYY'
ORDER  BY 1,2,3;
  • EXISTS半联接可以在第一场比赛时停止执行。由于我们只对CTE中的全有或全无的答案感兴趣,因此可以更快地完成

    的工作。

  • 这样我们也不需要聚合(不需要GROUP BY)。

  • 我还记得是否找到了任何字符或位置,并且只重新访问了实际匹配的表格。