使用JOIN而不是HAVING(COUNT> n)来提高性能

时间:2014-10-29 20:59:37

标签: sql performance postgresql join group-by

我有一张用户表,以及一张Facebook好友"他们之间的关系。给定(已知)用户列表,我想快速找到在该组中具有2个或更多用户的Facebook好友的所有用户。

(这基本上归结为一个问题:我可以重写GROUP BY / HAVING来使用JOIN吗?)

这是我正在使用的架构的简化版本。我在这里使用VARCHAR来使我的示例数据(下面)中的用户名更容易理解; IRL相关列是INT:

-- Simplified Schema
CREATE TABLE _users (
    user_name VARCHAR NOT NULL PRIMARY KEY,
    fb_id     VARCHAR NULL UNIQUE
);
CREATE TABLE _fb_friends (
    id           SERIAL PRIMARY KEY,
    user_name    VARCHAR NULL REFERENCES _users(user_name),
    friend_fb_id VARCHAR NULL REFERENCES _users(fb_id),
    UNIQUE (user_name, friend_fb_id)
);

请注意,friend_fb_id上没有(可访问的)索引。

另请注意,_fb_friends表非常庞大 - 比_users表大几个数量级 - 使明显的GROUP BY / HAVING解决方案变得非常慢。 I.E.这是不可行的:

-- Using GROUP BY/HAVING: Obvious solution, but way too slow.
-- Does a SEQ SCAN on the gigantic table
SELECT me.*
FROM
    _users me
    LEFT OUTER JOIN _fb_friends ff ON (
        ff.user_name = me.user_name
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff.friend_fb_id
    )
GROUP BY me.user_name
HAVING COUNT(friend.user_name) >= 2;

我重写了这个以使用JOIN,但我不确定我提出的解决方案是有效还是最佳:

-- Using JOINs: Way faster, but is it correct? Better way?
SELECT DISTINCT me.*
FROM (
    _users me
    LEFT OUTER JOIN _fb_friends ff1 ON (
        ff1.user_name = me.user_name
    )
    LEFT OUTER JOIN _fb_friends ff2 ON (
        ff2.user_name = me.user_name
        AND ff2.friend_fb_id <> ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend_2 ON (
        friend_2.fb_id = ff2.friend_fb_id
    )
)
WHERE (
    friend.user_name IS NOT NULL
    AND friend_2.user_name IS NOT NULL
);

为了它的价值,我写了一个简单的测试示例,似乎才能正常工作。但我真的不确定它是否正确,或者我是以最好的方式解决这个问题。两种策略都返回相同的用户:

BEGIN;

CREATE TABLE _users (
    user_name VARCHAR NOT NULL PRIMARY KEY,
    fb_id     VARCHAR NULL UNIQUE
);
CREATE TABLE _fb_friends (
    id           SERIAL PRIMARY KEY,
    user_name    VARCHAR NULL REFERENCES _users(user_name),
    friend_fb_id VARCHAR NULL REFERENCES _users(fb_id)
);
INSERT INTO _users (user_name, fb_id) VALUES
    ('Bob',    'bob'),
    ('Joe',    'joe'),
    ('Will',   'will'),
    ('Marcus', 'marcus'),
    ('Mitch',  'mitch'),
    ('Rick',   'rick');
INSERT INTO _fb_friends (user_name, friend_fb_id) VALUES
    ('Bob',    'joe'),
    ('Will',   'marcus'),
    ('Joe',    'bob'),
    ('Joe',    'marcus'),
    ('Joe',    'mitch'),
    ('Marcus', 'will'),
    ('Marcus', 'joe'),
    ('Mitch',  'joe');

SELECT 'GROUP BY/HAVING' AS Strategy, me.*
FROM
    _users me
    LEFT OUTER JOIN _fb_friends ff ON (
        ff.user_name = me.user_name
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff.friend_fb_id
    )
GROUP BY me.user_name
HAVING COUNT(friend.user_name) >= 2;

SELECT DISTINCT 'JOIN' AS Strategy, me.*
FROM (
    _users me
    LEFT OUTER JOIN _fb_friends ff1 ON (
        ff1.user_name = me.user_name
    )
    LEFT OUTER JOIN _fb_friends ff2 ON (
        ff2.user_name = me.user_name
        AND ff2.friend_fb_id <> ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend ON (
        friend.fb_id = ff1.friend_fb_id
    )
    LEFT OUTER JOIN _users friend_2 ON (
        friend_2.fb_id = ff2.friend_fb_id
    )
)
WHERE (
    friend.user_name IS NOT NULL
    AND friend_2.user_name IS NOT NULL
);

DROP TABLE _fb_friends;
DROP TABLE _users;

COMMIT;

基本上,我的问题是:

  1. 我的JOIN解决方案是否正确?
  2. 有更好的/规范的方式来解决这个问题吗?
  3. 索引friend_fb_id以及更改架构被视为禁区。我需要尽我所能。

2 个答案:

答案 0 :(得分:0)

你可以使用临时表吗?如果是这样,试一试......

drop table if exists friend_count; 

create temporary table friend_count ( 
  user_name varchar not null primary key, 
  friend_count int not null
); 

create index on friend_count (friend_count);

insert into friend_count select 
  user_name,
  count(*)
from _fb_friends
/* place more code here necessary to count only the firends within a smaller
  group of users */ 
group by user_name; 

select 
  me.user_name,
  me.fb_id
from _users me
join friend_count fc on fc.user_name = me.user_name
where fc.friend_count >= 2; 

答案 1 :(得分:0)

我没有足够大的数据集来检查,但看看它是否表现得更快。

select me.*
from _users me
where 2=(select count(1) from
          (select 1 from _fb_friends ff 
           join _users friend on friend.fb_id=ff.friend_fb_id
           where ff.user_name=me.user_name
           limit 2) x
         )