如何优化这个组明智的max sql查询?

时间:2015-04-20 12:56:46

标签: sql postgresql postgresql-9.3

这就是我想要的。它基本上通过daily_statistics抓取user_id表和组中的所有记录。同时,它执行以下操作:

  1. 用户的值按最近的
  2. 分组
  3. attachment_ids表示为数组,因此我可以确定用户拥有的附件数
  4. 结果:

     user_id | country_id |       time_at       | assumed_gender |    attachment_ids
    ---------+------------+---------------------+----------------+----------------------
       21581 |        172 | 2015-04-18 17:55:00 |                | [5942]
       21610 |        140 | 2015-04-18 19:55:00 | male           | [5940]
       22044 |        174 | 2015-04-18 21:55:00 | female         | [12312313, 12312313]
    
       21353 |        174 | 2015-04-18 20:59:00 | male           | [5938]
       21573 |        246 | 2015-04-18 21:57:00 | male           | [5936]
    (5 rows)
    

    以下查询执行缓慢。像17秒的东西。

      SELECT
        ds.user_id,
        max(case when id=maxid then country_id end) AS country_id,
        max(case when id=maxid then time_at end) AS time_at,
        max(case when id=maxid then properties->'assumed_gender' end) AS assumed_gender,
        json_agg(to_json(attachment_id)) AS attachment_ids
      FROM daily_statistics ds JOIN (
          SELECT u.id as user_id, (
            SELECT ds2.id FROM daily_statistics ds2 WHERE ds2.user_id=u.id AND ds2.metric = 'participation' AND ds2.status = 'active' AND ds2.campaign_id = 39
            ORDER BY ds2.id DESC LIMIT 1
          ) AS maxid FROM users u
          WHERE u.properties -> 'provider' IN ('twitter')
      ) mu ON (ds.user_id=mu.user_id)
      WHERE ds.campaign_id = 39 AND ds.metric = 'participation' AND ds.status = 'active'
      GROUP BY ds.user_id;
    

    问题在于group wise max语句。有没有办法优化此查询并获得相同的输出?我在考虑使用某种横向连接?但是,我无法获得每个用户attachment_id的数量。

    编辑:对于9k ++的记录,表大小为2m行:执行此查询大约需要25秒。

    foobar_production=> EXPLAIN ANALYZE SELECT
    foobar_production->     ds.user_id,
    foobar_production->     max(case when id=maxid then country_id end) AS country_id,
    foobar_production->     max(case when id=maxid then time_at end) AS time_at,
    foobar_production->     max(case when id=maxid then properties->'assumed_gender' end) AS assumed_gender,
    foobar_production->     json_agg(to_json(attachment_id)) AS attachment_ids
    foobar_production->   FROM daily_statistics ds JOIN (
    foobar_production(>       SELECT u.id as user_id, (
    foobar_production(>         SELECT ds2.id FROM daily_statistics ds2 WHERE ds2.user_id=u.id AND ds2.metric = 'participation' AND ds2.status = 'active' AND ds2.campaign_id = 4742
    foobar_production(>         ORDER BY ds2.id DESC LIMIT 1
    foobar_production(>       ) AS maxid FROM users u
    foobar_production(>       WHERE u.properties -> 'provider' IN ('twitter')
    foobar_production(>   ) mu ON (ds.user_id=mu.user_id)
    foobar_production->   WHERE ds.campaign_id = 4742 AND ds.metric = 'participation' AND ds.status = 'active'
    foobar_production->   GROUP BY ds.user_id;
                                                                                            QUERY PLAN                                             
    ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
     HashAggregate  (cost=2063.07..2063.08 rows=1 width=103) (actual time=25155.963..25156.859 rows=775 loops=1)
       ->  Nested Loop  (cost=0.98..1883.99 rows=2 width=103) (actual time=0.744..382.699 rows=2787 loops=1)
             ->  Index Scan using index_daily_statistics_on_campaign_id_and_type on daily_statistics ds  (cost=0.56..1621.73 rows=31 width=99) (actual time=0.107..33.513 rows=9751 loops=1)
                   Index Cond: (campaign_id = 4742)
                   Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
             ->  Index Scan using index_users_on_id_and_type on users u  (cost=0.42..8.45 rows=1 width=4) (actual time=0.024..0.024 rows=0 loops=9751)
                   Index Cond: (id = ds.user_id)
                   Filter: ((properties -> 'provider'::text) = 'twitter'::text)
                   Rows Removed by Filter: 1
       SubPlan 1
         ->  Limit  (cost=29.83..29.84 rows=1 width=4) (actual time=2.953..2.954 rows=1 loops=2787)
               ->  Sort  (cost=29.83..29.84 rows=1 width=4) (actual time=2.951..2.951 rows=1 loops=2787)
                     Sort Key: ds2.id
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Bitmap Heap Scan on daily_statistics ds2  (cost=25.80..29.82 rows=1 width=4) (actual time=2.381..2.702 rows=105 loops=2787)
                           Recheck Cond: ((user_id = u.id) AND (campaign_id = 4742))
                           Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
                           ->  BitmapAnd  (cost=25.80..25.80 rows=1 width=0) (actual time=2.365..2.365 rows=0 loops=2787)
                                 ->  Bitmap Index Scan on index_daily_statistics_on_user_id  (cost=0.00..5.60 rows=156 width=0) (actual time=0.072..0.072 rows=292 loops=2787)
                                       Index Cond: (user_id = u.id)
                                 ->  Bitmap Index Scan on index_daily_statistics_on_campaign_id_and_type  (cost=0.00..19.95 rows=453 width=0) (actual time=2.241..2.241 rows=9751 loops=2787)
                                       Index Cond: (campaign_id = 4742)
       SubPlan 2
         ->  Limit  (cost=29.83..29.84 rows=1 width=4) (actual time=2.879..2.880 rows=1 loops=2787)
               ->  Sort  (cost=29.83..29.84 rows=1 width=4) (actual time=2.876..2.876 rows=1 loops=2787)
                     Sort Key: ds2_1.id
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Bitmap Heap Scan on daily_statistics ds2_1  (cost=25.80..29.82 rows=1 width=4) (actual time=2.241..2.585 rows=105 loops=2787)
                           Recheck Cond: ((user_id = u.id) AND (campaign_id = 4742))
                           Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
                           ->  BitmapAnd  (cost=25.80..25.80 rows=1 width=0) (actual time=2.222..2.222 rows=0 loops=2787)
                                 ->  Bitmap Index Scan on index_daily_statistics_on_user_id  (cost=0.00..5.60 rows=156 width=0) (actual time=0.062..0.062 rows=292 loops=2787)
                                       Index Cond: (user_id = u.id)
                                 ->  Bitmap Index Scan on index_daily_statistics_on_campaign_id_and_type  (cost=0.00..19.95 rows=453 width=0) (actual time=2.124..2.124 rows=9751 loops=2787)
                                       Index Cond: (campaign_id = 4742)
       SubPlan 3
         ->  Limit  (cost=29.83..29.84 rows=1 width=4) (actual time=3.030..3.030 rows=1 loops=2787)
               ->  Sort  (cost=29.83..29.84 rows=1 width=4) (actual time=3.018..3.018 rows=1 loops=2787)
                     Sort Key: ds2_2.id
                     Sort Method: top-N heapsort  Memory: 25kB
                     ->  Bitmap Heap Scan on daily_statistics ds2_2  (cost=25.80..29.82 rows=1 width=4) (actual time=2.407..2.755 rows=105 loops=2787)
                           Recheck Cond: ((user_id = u.id) AND (campaign_id = 4742))
                           Filter: (((metric)::text = 'participation'::text) AND ((status)::text = 'active'::text))
                           ->  BitmapAnd  (cost=25.80..25.80 rows=1 width=0) (actual time=2.390..2.390 rows=0 loops=2787)
                                 ->  Bitmap Index Scan on index_daily_statistics_on_user_id  (cost=0.00..5.60 rows=156 width=0) (actual time=0.121..0.121 rows=292 loops=2787)
                                       Index Cond: (user_id = u.id)
                                 ->  Bitmap Index Scan on index_daily_statistics_on_campaign_id_and_type  (cost=0.00..19.95 rows=453 width=0) (actual time=2.233..2.233 rows=9751 loops=2787)
                                       Index Cond: (campaign_id = 4742)
     Total runtime: 25158.063 ms
    (49 rows)
    
    
    foobar_production=> \d daily_statistics;
                                           Table "public.daily_statistics"
        Column     |            Type             |                           Modifiers
    ---------------+-----------------------------+---------------------------------------------------------------
     id            | integer                     | not null default nextval('daily_statistics_id_seq'::regclass)
     type          | character varying(255)      |
     metric        | character varying(255)      |
     campaign_id   | integer                     |
     user_id       | integer                     |
     country_id    | integer                     |
     attachment_id | integer                     |
     time_at       | timestamp without time zone |
     properties    | hstore                      |
     status        | character varying(255)      | default 'active'::character varying
    Indexes:
        "daily_statistics_pkey" PRIMARY KEY, btree (id)
        "index_daily_statistics_on_attachment_id" btree (attachment_id)
        "index_daily_statistics_on_campaign_id_and_type" btree (campaign_id, type)
        "index_daily_statistics_on_country_id" btree (country_id)
        "index_daily_statistics_on_id" btree (id)
        "index_daily_statistics_on_metric" btree (metric)
        "index_daily_statistics_on_properties" gin (properties)
        "index_daily_statistics_on_status" btree (status)
        "index_daily_statistics_on_time_at" btree (time_at)
        "index_daily_statistics_on_user_id" btree (user_id)
    

    我们将不胜感激。

1 个答案:

答案 0 :(得分:1)

你似乎有2个部分:

  1. 第一个是获取用户的最新统计信息和
  2. 另一个正为用户累积所有attachement_id
  3. 两者都针对特定类型的统计信息。 如果您对users感兴趣,我会从头开始。

    使用此查询搜索最新条目:

    SELECT u.id,
           ds.country_id,
           ds.time_at,
           ds.properties->'assumed_gender' AS assumed_gender
      FROM users u
      JOIN LATERAL (
          SELECT * FROM daily_statistics
           WHERE user_id=u.id
             AND campaign_id = 39
             AND metric = 'participation'
             AND status = 'active'
           ORDER BY id DESC LIMIT 1
        ) ds ON true
     WHERE u.properties -> 'provider' IN ('twitter');
    

    我在这里使用LATERAL syntax,这对于这类查询很有用。

    聚合不会从中受益,因此需要另一个子查询。

    我最终得到了以下最终查询:

    SELECT u.id,
           ds.country_id,
           ds.time_at,
           ds.properties->'assumed_gender' AS assumed_gender,
           g.attachment_ids
      FROM users u
      JOIN LATERAL (
        SELECT * FROM daily_statistics
         WHERE user_id=u.id
           AND campaign_id = 39
           AND metric = 'participation'
           AND status = 'active'
         ORDER BY id DESC LIMIT 1
        ) ds ON true
      JOIN (
        SELECT user_id, json_agg(to_json(attachment_id)) AS attachment_ids
          FROM daily_statistics
         WHERE campaign_id = 39
           AND metric = 'participation'
           AND status = 'active'
         GROUP BY user_id
        ) g ON g.user_id=u.id
     WHERE u.properties -> 'provider' IN ('twitter');
    

    我假设,那个指数:

    CREATE INDEX i_ds_campaign4status
     ON daily_statistics(campaign_id, user_id, id)
     WHERE status='active';
    

    会有所帮助。这取决于您的数据,但是,如果您的所有统计信息都是active,则取决于WHERE条款。


    编辑:根据提供的计划,第二个查询会从使用聚合的联接中受益,因为它会减少LATERAL部分的迭代次数。 我会坚持这种做法。