Question

我在Postgresql 8.4.12中有以下表格：

           Table "public.ratings"
 Column |          Type          | Modifiers
--------+------------------------+-----------
 userid | character varying(128) |
 item   | character varying(128) |
 score  | integer                |
Indexes:
    "ratings_item" btree (item)
    "ratings_ui" btree (userid, item)
    "ratings_userid" btree (userid)

我想执行自我加入，以查找评定特定项目的所有用户评分的项目。为简单起见，我将使用查询来获取每个类似项目的评级数量，如此;

select r2.item,sum(1)
from ratings r1
left join ratings r2 using (userid)
where r1.item='an3.php'
group by r2.item

查询有效，但对于我的表有3600万条记录，它需要永远。当我解释声明时，我得到以下内容：

 GroupAggregate  (cost=8102958.42..8247621.18 rows=16978 width=17)    ->  Sort  (cost=8102958.42..8151108.60 rows=19260072 width=17)
         Sort Key: r2.item
         ->  Hash Left Join  (cost=1458652.29..4192647.43 rows=19260072 width=17)
               Hash Cond: ((r1.userid)::text = (r2.userid)::text)
               ->  Bitmap Heap Scan on ratings r1  (cost=868.20..77197.24 rows=24509 width
=22)
                     Recheck Cond: ((item)::text = 'an3.php'::text)
                     ->  Bitmap Index Scan on ratings_item  (cost=0.00..862.07 rows=24509 width=0)
                           Index Cond: ((item)::text = 'an3.php'::text)
               ->  Hash  (cost=711028.93..711028.93 rows=36763293 width=39)
                     ->  Seq Scan on ratings r2  (cost=0.00..711028.93 rows=36763293 width
=39)

根据经验，我假设＆＃34; Seq Scan on rating r2＆＃34;是罪魁祸首。

另一方面，如果我搜索一个不存在的项目：

select r2.item,sum(1) from ratings r1 left join ratings r2 using (userid)
where r1.item='targetitem' group by r2.item;

它似乎工作正常（即没有返回任何结果，它是立即的）

GroupAggregate  (cost=2235887.19..2248234.70 rows=16978 width=17)    ->  Sort  (cost=2235887.19..2239932.29 rows=1618038 width=17)
         Sort Key: r2.item
         ->  Nested Loop Left Join  (cost=0.00..1969469.94 rows=1618038 width=17)
               ->  Index Scan using ratings_item on ratings r1  (cost=0.00..8317.74 rows=2 059 width=22)
                     Index Cond: ((item)::text = 'targetitem'::text)
               ->  Index Scan using ratings_userid on ratings r2  (cost=0.00..947.24 rows= 419 width=39)
                     Index Cond: ((r1.userid)::text = (r2.userid)::text)

同样的表和查询在MySQL中运行正常，但我无法将我的推荐系统迁移到另一个数据库。

我做错了什么或者这与Postgres有关吗？有工作吗？

Answer 1

回答标题中的（修辞）问题：否。

我在这里看到了很多问题，从第一行开始。

Postgres 8.4 has reached EOL last year。没有人应该再使用它了，它已经太旧了。如果可能的话，升级到当前版本。

除此之外，你至少应该使用最新的次要版本。 8.4.12于2012-06-04发布，缺少两年的错误和安全修复程序。 8.2.23是死亡版本的最后一个版本 Read the versioning policy of the project.

接下来，varchar(128)作为PK / FK非常低效，特别是对于具有数百万行的表。处理起来不必要大而且昂贵。请改用integer or bigint。或UUID如果你真的需要更大的数字空间（我对此表示怀疑）。

接下来，我在UNIQUE（which would obsolete an additional index on the same）上看不到PRIMARY KEY或(userid, item)约束。您的表定义缺失或查询错误，或者您的问题已被破坏。

尝试这个重写的查询：

SELECT r2.item, count(*) AS ct
FROM  (
   SELECT userid
   FROM   ratings
   WHERE  item = 'an3.php'
   GROUP  BY 1  -- should not be necessary, but constraint is missing
   ) r1
JOIN   ratings r2 USING (userid)
GROUP  BY 1;

在现代Postgres中，您需要两个索引才能获得最佳性能。在(item, userid)和(userid, item)。

Is a composite index also good for queries on the first field?

在Postgres 9.2+中，你甚至可以从中获得仅有索引的扫描。我不确定如何充分利用您过时的版本。无论哪种方式，varchar(128)也是索引的昂贵数据类型。

PostgreSQL自联接忽略索引吗？

1 个答案: