PostgreSQL复杂求和查询

时间:2015-12-02 09:47:57

标签: postgresql

我有以下表格:

video (id, name) 

keyframe (id, name, video_id) /*video_id has fk on video.id*/

detector (id, concepts)

score (detector_id, keyframe_id, score) /*detector_id has fk on detector .id and keyframe_id has fk on keyframe.id*/

从本质上讲,视频有多个与之关联的关键帧,每个关键帧都经过了所有探测器的评分。每个探测器都有一系列概念,它们将对关键帧进行评分。

现在,我希望在单个查询中找到以下内容:

给定一系列探测器ID(例如,最大5),返回在这些探测器组合上得分最高的前10个视频。通过平均每个探测器的每个视频的关键帧得分,然后对探测器得分求和来对它们进行评分。

实施例: 对于具有3个关联关键帧的视频,其中包含以下两个检测器的分数:

detector_id | keyframe_id | score
1             1             0.0281
1             2             0.0012
1             3             0.0269
2             1             0.1341
2             2             0.9726
2             3             0.7125

这会得到视频的分数:

sum(avg(0.0281, 0.0012, 0.0269), avg(0.1341, 0.9726, 0.7125))

最终我想要以下结果:

video_id | score
1          0.417328
2          ...

我认为必须是这样的,但我还没有完成:

select
    (select
        (select sum(avg_score) summed_score
        from
        (select
            avg(s.score) avg_score
        from score s
        where s.detector_id = ANY(array[1,2,3,4,5]) and s.keyframe_id = kf.id) x)
    from keyframe kf
    where kf.video_id = v.id) y
from video v

我的分数表非常大(100M行),所以我希望它尽可能快(我试过的所有其他选项需要几分钟才能完成)。我每个视频总共有大约3000个视频,500个探测器和大约15个关键帧。

如果在不到2秒的时间内无法做到这一点,那么我也对数据库模式的重组方式持开放态度。根本不会在数据库中插入/删除。

修改

感谢GabrielsMessanger,我有一个答案,这是查询计划:

EXPLAIN (analyze, verbose)
SELECT
    v_id, sum(fd_avg_score)
FROM (
    SELECT 
        v.id as v_id, k.id as k_id, d.id as d_id,
        avg(s.score) as fd_avg_score
    FROM
        video v
        JOIN keyframe k ON k.video_id = v.id
        JOIN score s ON s.keyframe_id = k.id
        JOIN detector d ON d.id = s.detector_id
    WHERE
        d.id = ANY(ARRAY[1,2,3,4,5]) /*here goes detector's array*/
    GROUP BY
        v.id,
        k.id,
        d.id
) sub
GROUP BY
    v_id
;

"GroupAggregate  (cost=1865513.09..1910370.09 rows=200 width=12) (actual time=52141.684..52908.198 rows=2991 loops=1)"
"  Output: v.id, sum((avg(s.score)))"
"  Group Key: v.id"
"  ->  GroupAggregate  (cost=1865513.09..1893547.46 rows=1121375 width=20) (actual time=52141.623..52793.184 rows=1121375 loops=1)"
"        Output: v.id, k.id, d.id, avg(s.score)"
"        Group Key: v.id, k.id, d.id"
"        ->  Sort  (cost=1865513.09..1868316.53 rows=1121375 width=20) (actual time=52141.613..52468.062 rows=1121375 loops=1)"
"              Output: v.id, k.id, d.id, s.score"
"              Sort Key: v.id, k.id, d.id"
"              Sort Method: external merge  Disk: 37232kB"
"              ->  Hash Join  (cost=11821.18..1729834.13 rows=1121375 width=20) (actual time=120.706..51375.777 rows=1121375 loops=1)"
"                    Output: v.id, k.id, d.id, s.score"
"                    Hash Cond: (k.video_id = v.id)"
"                    ->  Hash Join  (cost=11736.89..1711527.49 rows=1121375 width=20) (actual time=119.862..51141.066 rows=1121375 loops=1)"
"                          Output: k.id, k.video_id, s.score, d.id"
"                          Hash Cond: (s.keyframe_id = k.id)"
"                          ->  Nested Loop  (cost=4186.70..1673925.96 rows=1121375 width=16) (actual time=50.878..50034.247 rows=1121375 loops=1)"
"                                Output: s.score, s.keyframe_id, d.id"
"                                ->  Seq Scan on public.detector d  (cost=0.00..11.08 rows=5 width=4) (actual time=0.011..0.079 rows=5 loops=1)"
"                                      Output: d.id, d.concepts"
"                                      Filter: (d.id = ANY ('{1,2,3,4,5}'::integer[]))"
"                                      Rows Removed by Filter: 492"
"                                ->  Bitmap Heap Scan on public.score s  (cost=4186.70..332540.23 rows=224275 width=16) (actual time=56.040..9961.040 rows=224275 loops=5)"
"                                      Output: s.detector_id, s.keyframe_id, s.score"
"                                      Recheck Cond: (s.detector_id = d.id)"
"                                      Rows Removed by Index Recheck: 34169904"
"                                      Heap Blocks: exact=192845 lossy=928530"
"                                      ->  Bitmap Index Scan on score_index  (cost=0.00..4130.63 rows=224275 width=0) (actual time=49.748..49.748 rows=224275 loops=5)"
"                                            Index Cond: (s.detector_id = d.id)"
"                          ->  Hash  (cost=3869.75..3869.75 rows=224275 width=8) (actual time=68.924..68.924 rows=224275 loops=1)"
"                                Output: k.id, k.video_id"
"                                Buckets: 16384  Batches: 4  Memory Usage: 2205kB"
"                                ->  Seq Scan on public.keyframe k  (cost=0.00..3869.75 rows=224275 width=8) (actual time=0.003..33.662 rows=224275 loops=1)"
"                                      Output: k.id, k.video_id"
"                    ->  Hash  (cost=46.91..46.91 rows=2991 width=4) (actual time=0.834..0.834 rows=2991 loops=1)"
"                          Output: v.id"
"                          Buckets: 1024  Batches: 1  Memory Usage: 106kB"
"                          ->  Seq Scan on public.video v  (cost=0.00..46.91 rows=2991 width=4) (actual time=0.005..0.417 rows=2991 loops=1)"
"                                Output: v.id"
"Planning time: 2.136 ms"
"Execution time: 52914.840 ms"

1 个答案:

答案 0 :(得分:1)

声明:

我的最终答案是基于coments并与作者扩展聊天讨论。有一点需要注意:每个keyframe_id只分配给一个视频

原始答案:

这跟下面的查询一样简单吗?:

SELECT
    v_id, sum(fd_avg_score)
FROM (
    SELECT 
        v.id as v_id, k.id as k_id, s.detector_id as d_id,
        avg(s.score) as fd_avg_score
    FROM
        video v
        JOIN keyframe k ON k.video_id = v.id
        JOIN score s ON s.keyframe_id = k.id
    WHERE
        s.detector_id = ANY(ARRAY[1,2,3,4,5]) /*here goes detector's array*/
    GROUP BY
        v.id,
        k.id,
        detector_id
) sub
GROUP BY
    v_id
LIMIT 10
;

首先,在子查询中,我们使用关键帧和关键帧加入视频。我们计算每个视频的平均得分,每个视频和每个探测器的每个关键帧(正如您所说)。最后,在主查询中,我们对每个视频的avg_score进行总结。

性能

正如作者所说,他在每个表的PRIMARY KEYS列上都有id,并且在表score(detector_id, keyrame_id)上也有复合索引。这足以快速运行此查询。

但是,测试作者需要进一步优化。所以有两件事:

  1. 记住总是对表执行VACUUM ANALYZE,尤其是如果您插入100M行(如score表)。所以至少要执行VACUUM ANALYZE score
  2. 要尝试优化更多,我们可以将score(detector_id, keyrame_id)上的复合索引更改为score(detector_id, keyrame_id, score)上的复合索引。它可能允许PostgreSQL在计算平均值时使用Index Only Scan