我有一张用户表,以及一张Facebook好友"他们之间的关系。给定(已知)用户列表,我想快速找到在该组中具有2个或更多用户的Facebook好友的所有用户。
(这基本上归结为一个问题:我可以重写GROUP BY / HAVING来使用JOIN吗?)
这是我正在使用的架构的简化版本。我在这里使用VARCHAR来使我的示例数据(下面)中的用户名更容易理解; IRL相关列是INT:
-- Simplified Schema
CREATE TABLE _users (
user_name VARCHAR NOT NULL PRIMARY KEY,
fb_id VARCHAR NULL UNIQUE
);
CREATE TABLE _fb_friends (
id SERIAL PRIMARY KEY,
user_name VARCHAR NULL REFERENCES _users(user_name),
friend_fb_id VARCHAR NULL REFERENCES _users(fb_id),
UNIQUE (user_name, friend_fb_id)
);
请注意,friend_fb_id上没有(可访问的)索引。
另请注意,_fb_friends表非常庞大 - 比_users表大几个数量级 - 使明显的GROUP BY / HAVING解决方案变得非常慢。 I.E.这是不可行的:
-- Using GROUP BY/HAVING: Obvious solution, but way too slow.
-- Does a SEQ SCAN on the gigantic table
SELECT me.*
FROM
_users me
LEFT OUTER JOIN _fb_friends ff ON (
ff.user_name = me.user_name
)
LEFT OUTER JOIN _users friend ON (
friend.fb_id = ff.friend_fb_id
)
GROUP BY me.user_name
HAVING COUNT(friend.user_name) >= 2;
我重写了这个以使用JOIN,但我不确定我提出的解决方案是有效还是最佳:
-- Using JOINs: Way faster, but is it correct? Better way?
SELECT DISTINCT me.*
FROM (
_users me
LEFT OUTER JOIN _fb_friends ff1 ON (
ff1.user_name = me.user_name
)
LEFT OUTER JOIN _fb_friends ff2 ON (
ff2.user_name = me.user_name
AND ff2.friend_fb_id <> ff1.friend_fb_id
)
LEFT OUTER JOIN _users friend ON (
friend.fb_id = ff1.friend_fb_id
)
LEFT OUTER JOIN _users friend_2 ON (
friend_2.fb_id = ff2.friend_fb_id
)
)
WHERE (
friend.user_name IS NOT NULL
AND friend_2.user_name IS NOT NULL
);
为了它的价值,我写了一个简单的测试示例,似乎才能正常工作。但我真的不确定它是否正确,或者我是以最好的方式解决这个问题。两种策略都返回相同的用户:
BEGIN;
CREATE TABLE _users (
user_name VARCHAR NOT NULL PRIMARY KEY,
fb_id VARCHAR NULL UNIQUE
);
CREATE TABLE _fb_friends (
id SERIAL PRIMARY KEY,
user_name VARCHAR NULL REFERENCES _users(user_name),
friend_fb_id VARCHAR NULL REFERENCES _users(fb_id)
);
INSERT INTO _users (user_name, fb_id) VALUES
('Bob', 'bob'),
('Joe', 'joe'),
('Will', 'will'),
('Marcus', 'marcus'),
('Mitch', 'mitch'),
('Rick', 'rick');
INSERT INTO _fb_friends (user_name, friend_fb_id) VALUES
('Bob', 'joe'),
('Will', 'marcus'),
('Joe', 'bob'),
('Joe', 'marcus'),
('Joe', 'mitch'),
('Marcus', 'will'),
('Marcus', 'joe'),
('Mitch', 'joe');
SELECT 'GROUP BY/HAVING' AS Strategy, me.*
FROM
_users me
LEFT OUTER JOIN _fb_friends ff ON (
ff.user_name = me.user_name
)
LEFT OUTER JOIN _users friend ON (
friend.fb_id = ff.friend_fb_id
)
GROUP BY me.user_name
HAVING COUNT(friend.user_name) >= 2;
SELECT DISTINCT 'JOIN' AS Strategy, me.*
FROM (
_users me
LEFT OUTER JOIN _fb_friends ff1 ON (
ff1.user_name = me.user_name
)
LEFT OUTER JOIN _fb_friends ff2 ON (
ff2.user_name = me.user_name
AND ff2.friend_fb_id <> ff1.friend_fb_id
)
LEFT OUTER JOIN _users friend ON (
friend.fb_id = ff1.friend_fb_id
)
LEFT OUTER JOIN _users friend_2 ON (
friend_2.fb_id = ff2.friend_fb_id
)
)
WHERE (
friend.user_name IS NOT NULL
AND friend_2.user_name IS NOT NULL
);
DROP TABLE _fb_friends;
DROP TABLE _users;
COMMIT;
基本上,我的问题是:
索引friend_fb_id以及更改架构被视为禁区。我需要尽我所能。
答案 0 :(得分:0)
你可以使用临时表吗?如果是这样,试一试......
drop table if exists friend_count;
create temporary table friend_count (
user_name varchar not null primary key,
friend_count int not null
);
create index on friend_count (friend_count);
insert into friend_count select
user_name,
count(*)
from _fb_friends
/* place more code here necessary to count only the firends within a smaller
group of users */
group by user_name;
select
me.user_name,
me.fb_id
from _users me
join friend_count fc on fc.user_name = me.user_name
where fc.friend_count >= 2;
答案 1 :(得分:0)
我没有足够大的数据集来检查,但看看它是否表现得更快。
select me.*
from _users me
where 2=(select count(1) from
(select 1 from _fb_friends ff
join _users friend on friend.fb_id=ff.friend_fb_id
where ff.user_name=me.user_name
limit 2) x
)