我有一个架构(数百万条带有适当索引的记录),如下所示:
groups | interests
------ | ---------
user_id | user_id
group_id | interest_id
用户可以喜欢0..many兴趣并且属于0..many组。
问题:鉴于群组ID,我希望获得所有不属于该群组的用户的所有兴趣,并且与属于该群组的任何人分享至少一个兴趣同一个提供的小组。
由于上述内容可能令人困惑,这是一个简单的例子(SQLFiddle):
| 1 | 2 | 3 | 4 | 5 | (User IDs)
|-------------------|
| A | | A | | |
| B | B | B | | B |
| | C | | | |
| | | D | D | |
在上面的示例中,用户使用数字标记,而兴趣包含字符。
如果我们假设用户1和2属于组-1,那么用户3和5会很有趣:
user_id interest_id
------- -----------
3 A
3 B
3 D
5 B
我已经写了一个愚蠢且非常低效的查询,正确地返回上面的内容:
SELECT * FROM "interests" WHERE "user_id" IN (
SELECT "user_id" FROM "interests" WHERE "interest_id" IN (
SELECT "interest_id" FROM "interests" WHERE "user_id" IN (
SELECT "user_id" FROM "groups" WHERE "group_id" = -1
)
) AND "user_id" NOT IN (
SELECT "user_id" FROM "groups" WHERE "group_id" = -1
)
);
但是我所有尝试将其转换为正确的连接查询的尝试都表明它们没有结果:要么查询返回的行数多于它应该的行,要么只需要子查询的10倍,如:
SELECT "iii"."user_id" FROM "interests" AS "iii"
WHERE EXISTS
(
SELECT "ii"."user_id", "ii"."interest_id" FROM "groups" AS "gg"
INNER JOIN "interests" AS "ii" ON "gg"."user_id" = "ii"."user_id"
WHERE EXISTS
(
SELECT "i"."interest_id" FROM "groups" AS "g"
INNER JOIN "interests" AS "i" ON "g"."user_id" = "i"."user_id"
WHERE "group_id" = -1 AND "i"."interest_id" = "ii"."interest_id"
) AND "group_id" != -1 AND "ii"."user_id" = "iii"."user_id"
);
过去两晚我一直在努力优化此查询...
任何帮助或洞察力都能让我朝着正确的方向前进,我们将不胜感激。 :)
PS:理想情况下,一个返回共同兴趣聚合计数的查询会更好:
user_id totalInterests commonInterests
------- -------------- ---------------
3 3 1/2 (either is fine, but 2 is better)
5 1 1
但是,我不确定与在代码中执行此操作相比会有多慢。
答案 0 :(得分:3)
使用以下方法设置测试表
--drop table Interests ----------------------------
CREATE TABLE Interests
(
InterestId char(1) not null
,UserId int not null
)
INSERT Interests values
('A',1)
,('A',3)
,('B',1)
,('B',2)
,('B',3)
,('B',5)
,('C',2)
,('D',3)
,('D',4)
-- drop table Groups ---------------------
CREATE TABLE Groups
(
GroupId int not null
,UserId int not null
)
INSERT Groups values
(-1, 1)
,(-1, 2)
SELECT * from Groups
SELECT * from Groups
以下查询似乎可以执行您想要的操作:
DECLARE @GroupId int
SET @GroupId = -1
;WITH cteGroupInterests (InterestId)
as (-- List of the interests referenced by the target group
select distinct InterestId
from Groups gr
inner join Interests nt
on nt.UserId = gr.UserId
where gr.GroupId = @GroupId)
-- Aggregate interests for each user
SELECT
UserId
,count(OwnInterstId) OwnInterests
,count(SharedInterestId) SharedInterests
from (-- Subquery lists all interests for each user
select
nt.UserId
,nt.InterestId OwnInterstId
,cte.InterestId SharedInterestId
from Interests nt
left outer join cteGroupInterests cte
on cte.InterestId = nt.InterestId
where not exists (-- Correlated subquery: is "this" user in the target group?)
select 1
from Groups gr
where gr.GroupId = @GroupId
and gr.UserId = nt.UserId)) xx
group by UserId
having count(SharedInterestId) > 0
它似乎有用,但我想做更精细的测试,而且我不知道它对数百万行的效果如何。要点是:
not exists
应该会非常快速答案 1 :(得分:1)
这有点令人困惑。我认为最好的方法是exists
和not exists
:
select i.*
from interest i
where not exists (select 1
from groups g
where i.user_id = g.user_id and
g.group_id = $group_id
) and
exists (select 1
from groups g join
interest i2
on g.user_id = i2.user_id
where g.user_id <> i.user_user_id and
i.interest_id = i2.interest_id
);
第一个子查询是说用户不在组中。第二个是说与团队成员分享兴趣。