使用子查询时如何使用连接消除笛卡尔积?

时间:2014-07-03 14:05:12

标签: sql postgresql join postgresql-9.3 cartesian-product

我有以下数据库:

 paperid | authorid | name
---------+----------+---------------
 1889374 |   897449 | D. N. Page
 1889374 |  1795881 | C. N. Pope
 1889374 |  1952069 | S. W. Hawking

我想创建一个包含以下列的表:

  • paperid
  • 作者姓名 - 本报告的每位作者
  • 共同作者 - 对于该论文的每位合着者而言

结果应如下所示:

 paperid |    author     |          coauthors          
---------+---------------+---------------------------
 1889374 | D. N. Page    |  C. N. Pope S. W. Hawking
 1889374 | C. N. Pope    | D. N. Page  S. W. Hawking
 1889374 | S. W. Hawking | D. N. Page C. N. Pope 

这是通过以下查询实现的:

SELECT  foo.paperid, npa.name as author, foo.coauthors
INTO npatest
FROM newpaperauthor npa
CROSS JOIN (
   SELECT paperid, string_agg(name, ' ') as coauthors
   FROM newpaperauthor
   GROUP BY paperid
   ORDER BY paperid) foo;
UPDATE npatest SET coauthors = regexp_replace(coauthors, author, '');
SELECT * FROM npatest;

当数据库中有更多paperid时,问题就出现了:

 paperid | authorid |       name       |      affiliation       
---------+----------+------------------+------------------------
 1889373 |   122817 | Kazuhiro Hongo   | 
 1889373 |  1091191 | Hiroshi NAKAGAWA | 
 1889373 |  1874415 | Hiroshi Nakagawa | University of Oklahoma
 1889373 |  2149773 | Han Soo Chang    | 
 1889374 |   897449 | D. N. Page       | 
 1889374 |  1795881 | C. N. Pope       | 
 1889374 |  1952069 | S. W. Hawking    | 

然后我会得到他们的笛卡尔产品,如:

 paperid |      author      |                           coauthors                            
---------+------------------+----------------------------------------------------------------
 1889373 | Kazuhiro Hongo   |  Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | Kazuhiro Hongo   | D. N. Page C. N. Pope S. W. Hawking
 1889373 | Hiroshi NAKAGAWA | Kazuhiro Hongo  Hiroshi Nakagawa Han Soo Chang
 1889374 | Hiroshi NAKAGAWA | D. N. Page C. N. Pope S. W. Hawking
 1889373 | Hiroshi Nakagawa | Kazuhiro Hongo Hiroshi NAKAGAWA  Han Soo Chang
 1889374 | Hiroshi Nakagawa | D. N. Page C. N. Pope S. W. Hawking
 1889373 | Han Soo Chang    | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa 
 1889374 | Han Soo Chang    | D. N. Page C. N. Pope S. W. Hawking
 1889373 | D. N. Page       | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | D. N. Page       |  C. N. Pope S. W. Hawking
 1889373 | C. N. Pope       | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | C. N. Pope       | D. N. Page  S. W. Hawking
 1889373 | S. W. Hawking    | Kazuhiro Hongo Hiroshi NAKAGAWA Hiroshi Nakagawa Han Soo Chang
 1889374 | S. W. Hawking    | D. N. Page C. N. Pope 

如何摆脱那里的笛卡尔产品?

3 个答案:

答案 0 :(得分:3)

以下是解决此问题的方法:

将所有共同作者的列表生成为子查询。生成所有作者的列表。然后将它们连接在一起并进行字符串操作以获得所需的内容。


作者很容易:

select paperid, npa.name as author
from newpaperauthor npa;

共同作者很容易:

select paperid, string_agg(npa.name, ' ') as coauthors
from newpaperauthor npa
group by paperid;

组合需要一些列表替换:

select a.paperid, a.author,
       replace(replace(coauthors, author, ''), '  ', ' ') as coauthors
from (select paperid, npa.name as author
      from newpaperauthor npa
     ) a join
     (select paperid, string_agg(npa.name, ' ') as coauthors
      from newpaperauthor npa
      group by paperid
     ) ca
     on a.paperid = ca.paperid;

答案 1 :(得分:2)

这可以非常简单 array_agg(),因为窗口聚合函数与array_remove()结合使用(第9.3页引入):

CREATE TABLE npatest AS
SELECT paperid, name AS author
     , array_to_string(array_remove(array_agg(name) OVER (PARTITION BY paperid), name), ', ') AS coauthors
FROM   newpaperauthor n;

如果作者姓名不是唯一的,则会出现并发症 然后,如果作者姓名不是唯一的,那么整个操作都是有缺陷的。

使用array_agg()array_remove()代替string_agg()regexp_replace(),因为后者会因“Jon Fox”和“Jon Foxy”等类似名称而轻易失败,分隔符也很混乱。

array_to_string()将数组转换为字符串。我使用', '作为分隔符,这对我来说似乎比空间更明智。

不鼓励使用SELECT INTO。请改用上级CREATE TABLE ASPer documentation:

  

CREATE TABLE AS是推荐的语法,因为这种形式   SELECT INTO在ECPG或PL / pgSQL中不可用,因为它们   不同地解释INTO子句。此外,CREATE TABLE AS   提供SELECT INTO提供的功能的超集。

SQL Fiddle.

答案 2 :(得分:0)

@GordonLinoff的查询可以通过压缩聚合中的第一作者来简化:

SELECT DISTINCT
        p0.paperid , p0.authorid , p0.name as name1
        , string_agg(p1.name, ', ' ) AS others
FROM papers p0
JOIN papers p1 ON p1.paperid = p0.paperid AND p1.authorid <> p0.authorid
GROUP BY p0.paperid, p0.authorid, p0.name
ORDER BY p0.paperid, p0.authorid
        ;