SQL - 计算兴趣之间的重叠

时间:2015-10-28 22:27:57

标签: sql sqlite join query-optimization

我有一个架构(数百万条带有适当索引的记录),如下所示:

groups    |  interests
------    |  ---------
user_id   |  user_id
group_id  |  interest_id

用户可以喜欢0..many兴趣并且属于0..many组。

问题:鉴于群组ID,我希望获得所有不属于该群组的用户的所有兴趣,并且与属于该群组的任何人分享至少一个兴趣同一个提供的小组。

由于上述内容可能令人困惑,这是一个简单的例子(SQLFiddle):

| 1 | 2 | 3 | 4 | 5 | (User IDs)
|-------------------|
| A |   | A |   |   |
| B | B | B |   | B |
|   | C |   |   |   |
|   |   | D | D |   |

在上面的示例中,用户使用数字标记,而兴趣包含字符。

如果我们假设用户1和2属于组-1,那么用户3和5会很有趣:

user_id  interest_id
-------  -----------
      3            A
      3            B
      3            D
      5            B

我已经写了一个愚蠢且非常低效的查询,正确地返回上面的内容:

SELECT * FROM "interests" WHERE "user_id" IN (
    SELECT "user_id" FROM "interests" WHERE "interest_id" IN (
        SELECT "interest_id" FROM "interests" WHERE "user_id" IN (
            SELECT "user_id" FROM "groups" WHERE "group_id" = -1
        )
    ) AND "user_id" NOT IN (
        SELECT "user_id" FROM "groups" WHERE "group_id" = -1
    )
);

但是我所有尝试将其转换为正确的连接查询的尝试都表明它们没有结果:要么查询返回的行数多于它应该的行,要么只需要子查询的10倍,如:

SELECT "iii"."user_id" FROM "interests" AS "iii"
WHERE EXISTS
(
    SELECT "ii"."user_id", "ii"."interest_id" FROM "groups" AS "gg"
    INNER JOIN "interests" AS "ii" ON "gg"."user_id" = "ii"."user_id"
    WHERE EXISTS
    (
        SELECT "i"."interest_id" FROM "groups" AS "g"
        INNER JOIN "interests" AS "i" ON "g"."user_id" = "i"."user_id"
        WHERE "group_id" = -1 AND "i"."interest_id" = "ii"."interest_id"
    ) AND "group_id" != -1 AND "ii"."user_id" = "iii"."user_id"
);

过去两晚我一直在努力优化此查询...

任何帮助或洞察力都能让我朝着正确的方向前进,我们将不胜感激。 :)

PS:理想情况下,一个返回共同兴趣聚合计数的查询会更好:

user_id  totalInterests  commonInterests
-------  --------------  ---------------
      3               3              1/2 (either is fine, but 2 is better)
      5               1                1

但是,我不确定与在代码中执行此操作相比会有多慢。

2 个答案:

答案 0 :(得分:3)

使用以下方法设置测试表

--drop table Interests  ----------------------------
CREATE TABLE Interests
 (
   InterestId  char(1)  not null
  ,UserId      int      not null
 )

INSERT Interests values
  ('A',1)
 ,('A',3)
 ,('B',1)
 ,('B',2)
 ,('B',3)
 ,('B',5)
 ,('C',2)
 ,('D',3)
 ,('D',4)


--  drop table Groups  ---------------------
CREATE TABLE Groups
 (
   GroupId  int  not null
  ,UserId   int  not null
 )

INSERT Groups values
  (-1, 1)
 ,(-1, 2)


SELECT * from Groups
SELECT * from Groups

以下查询似乎可以执行您想要的操作:

DECLARE @GroupId int

SET @GroupId = -1

;WITH cteGroupInterests (InterestId)
 as (--  List of the interests referenced by the target group
     select distinct InterestId
      from Groups gr
       inner join Interests nt
        on nt.UserId = gr.UserId
      where gr.GroupId = @GroupId)
--  Aggregate interests for each user
SELECT
   UserId
  ,count(OwnInterstId)      OwnInterests
  ,count(SharedInterestId)  SharedInterests
 from (--  Subquery lists all interests for each user
       select
          nt.UserId
         ,nt.InterestId   OwnInterstId
         ,cte.InterestId  SharedInterestId
        from Interests nt
         left outer join cteGroupInterests cte
          on cte.InterestId = nt.InterestId
        where not exists (--  Correlated subquery: is "this" user in the target group?)
                          select 1
                           from Groups gr
                           where gr.GroupId = @GroupId
                            and gr.UserId = nt.UserId)) xx
 group by UserId
 having count(SharedInterestId) > 0

它似乎有用,但我想做更精细的测试,而且我不知道它对数百万行的效果如何。要点是:

  • cte创建后一个查询引用的临时表;构建一个实际的临时表可能会提升性能
  • 相关的子查询可能很棘手,但索引和not exists应该会非常快速
  • 我很懒,遗漏了所有的下划线,对不起

答案 1 :(得分:1)

这有点令人困惑。我认为最好的方法是existsnot exists

select i.*
from interest i
where not exists (select 1
                  from groups g
                  where i.user_id = g.user_id and
                        g.group_id = $group_id
                 ) and
      exists (select 1
              from groups g join
                   interest i2
                   on g.user_id = i2.user_id
              where g.user_id <> i.user_user_id and
                    i.interest_id = i2.interest_id
             );

第一个子查询是说用户不在组中。第二个是说与团队成员分享兴趣。