PostgreSQL / Python - 获取最后N行不重复

时间:2015-05-10 17:50:40

标签: python postgresql

有什么方法可以做到这一点吗?

EG。如果我的表包含以下元素:

id | username | profile_photo
---+----------+--------------
 1 |     juan | urlphoto/juan
 2 |   nestor | urlphoto/nestor
 3 |    pablo | urlphoto/pablo
 4 |    pablo | urlphoto/pablo

并且,我希望得到最后2(两)行:

id 2 -> nestor | urlphoto/nestor
id 3 -> pablo  | urlphoto/pablo

感谢您的时间。

解决方案:

解决方法是插入一个项目(如果尚未包含在前n个元素中

import psycopg2, psycopg2.extras, json
db = psycopg2.connect("")

cursor = db.cursor(cursor_factory=psycopg2.extras.RealDictCursor)
cursor.execute("SELECT * FROM users ORDER BY id DESC LIMIT n;")
row = [item['user_id'] for item in cursor.fetchall()]

if not user_id in row:
    cursor.execute("INSERT..")
    db.commit()
cursor.close()
db.close()

2 个答案:

答案 0 :(得分:0)

如果您不关心最终的行顺序,请转到

SELECT min(id), username, profile_photo 
FROM oh_my_table
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2

答案 1 :(得分:0)

你没有描述构成重复行的内容(在你的例子中没有重复,因为所有行都是唯一的,这要归功于id),但我假设你希望除了id和你之外的所有列上的行都是不同的不关心它可能是几个可能重复的id。

让我们从一些测试数据开始:

CREATE UNLOGGED TABLE profile_photos (id int, username text, profile_photo text);
Time: 417.014 ms

INSERT INTO profile_photos
SELECT g.id, r.username, 'urlphoto/' || r.username
FROM generate_series(1, 10000000) g (id)
CROSS JOIN substr(md5(g.id::text), 0, 8) r (username);
INSERT 0 10000000
Time: 24497.335 ms

我将测试两种可能的解决方案,这些是每种解决方案的两个索引:

CREATE INDEX id_btree ON profile_photos USING btree (id);
CREATE INDEX
Time: 8139.347 ms

CREATE INDEX username_profile_photo_id_btree ON profile_photos USING btree (username, profile_photo, id DESC);
CREATE INDEX
Time: 81667.411 ms

VACUUM ANALYZE profile_photos;
VACUUM
Time: 1338.034 ms

所以第一个解决方案是Sami和Clément给出的解决方案(他们的查询基本相同):

SELECT min(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2;

   min    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 5088.611 ms

结果看起来正确,但如果其中任何用户之前发布过个人资料照片,则此查询可能会产生不需要的结果。让我们仿效:

UPDATE profile_photos
SET (username, profile_photo) = ('d1ca3aa', 'urlphoto/d1ca3aa')
WHERE id = 1;
UPDATE 1
Time: 1.313 ms

SELECT min(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY min(id) DESC 
LIMIT 2;

   min   | username |  profile_photo   
---------+----------+------------------
 9999999 | 283f427  | urlphoto/283f427
 9999998 | facf1f3  | urlphoto/facf1f3
(2 rows)
Time: 5032.213 ms

因此查询忽略了用户可能添加的任何新内容。它看起来不像你想要的,所以我建议用max(id)替换min(id):

SELECT max(id), username, profile_photo 
FROM profile_photos
GROUP BY username, profile_photo
ORDER BY max(id) DESC 
LIMIT 2;

   max    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 5068.507 ms

是的,但看起来很慢。查询计划是:

                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=655369.97..655369.98 rows=2 width=29) (actual time=6215.284..6215.285 rows=2 loops=1)
   ->  Sort  (cost=655369.97..678809.36 rows=9375755 width=29) (actual time=6215.282..6215.282 rows=2 loops=1)
         Sort Key: (max(id))
         Sort Method: top-N heapsort  Memory: 25kB
         ->  GroupAggregate  (cost=0.56..561612.42 rows=9375755 width=29) (actual time=0.104..4945.534 rows=9816449 loops=1)
               ->  Index Only Scan using username_profile_photo_id_btree on profile_photos  (cost=0.56..392855.43 rows=9999925 width=29) (actual time=0.089..1849.036 rows=10000000 loops=1)
                     Heap Fetches: 0
 Total runtime: 6215.344 ms
(8 rows)

这里要注意的是,没有合法使用需要GROUP BY的聚合:在这种情况下,GROUP BY用于过滤重复项,这里唯一的聚合是一个解决方法来挑选任何一个他们Postgres有一个扩展,允许您丢弃一组列上的重复项:

SELECT *
FROM (    
    SELECT DISTINCT ON (username, profile_photo) *
    FROM profile_photos
) X
ORDER BY id DESC
LIMIT 2;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 3779.723 ms

这有点快,这就是原因:

                                                                                         QUERY PLAN                                                                                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=630370.16..630370.17 rows=2 width=29) (actual time=4921.031..4921.031 rows=2 loops=1)
   ->  Sort  (cost=630370.16..653809.55 rows=9375755 width=29) (actual time=4921.030..4921.030 rows=2 loops=1)
         Sort Key: profile_photos.id
         Sort Method: top-N heapsort  Memory: 25kB
         ->  Unique  (cost=0.56..442855.06 rows=9375755 width=29) (actual time=0.114..4220.410 rows=9816449 loops=1)
               ->  Index Only Scan using username_profile_photo_id_btree on profile_photos  (cost=0.56..392855.43 rows=9999925 width=29) (actual time=0.111..2040.601 rows=10000000 loops=1)
                     Heap Fetches: 0
 Total runtime: 4921.081 ms
(8 rows)

如果我们能以某种方式使用简单的ORDER BY ID DESC LIMIT 1获取最后一行,并从表的末尾查找另一行,那将不会重复怎么办?

WITH first AS (
    SELECT *
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1
)
SELECT *
FROM first
UNION ALL
(SELECT *
FROM profile_photos p
WHERE EXISTS (
    SELECT 1
    FROM first
    WHERE (first.username, first.profile_photo) <> (p.username, p.profile_photo))
ORDER BY id DESC
LIMIT 1);

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
(2 rows)
Time: 1.217 ms

这非常快,但是手工定制只能产生两排。让我们用更“自动”的东西取代它:

WITH RECURSIVE last (id, username, profile_photo, a) AS (
    (SELECT id, username, profile_photo, ARRAY[ROW(username, profile_photo)] a
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1)
    UNION ALL
    (SELECT older.id, older.username, older.profile_photo, last.a || ROW(older.username, older.profile_photo)
    FROM last
    JOIN profile_photos older ON last.id > older.id AND NOT ROW(older.username, older.profile_photo) = ANY(last.a)
    WHERE array_length(a, 1) < 10
    ORDER BY id DESC
    LIMIT 1)
)
SELECT id, username, profile_photo
FROM last;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
  9999998 | facf1f3  | urlphoto/facf1f3
  9999997 | 305ebab  | urlphoto/305ebab
  9999996 | 74ab43a  | urlphoto/74ab43a
  9999995 | 23f2458  | urlphoto/23f2458
  9999994 | 6b465af  | urlphoto/6b465af
  9999993 | 33ee85a  | urlphoto/33ee85a
  9999992 | c0b9ef4  | urlphoto/c0b9ef4
  9999991 | b63d5bf  | urlphoto/b63d5bf
(10 rows)
Time: 2706.837 ms

这比以前的查询要快,但正如您在下面的查询计划中所看到的,对于每个产生的行,它必须扫描id上的索引。

                                                                                      QUERY PLAN                                                                                       
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on last  (cost=6.52..6.74 rows=11 width=68) (actual time=0.104..4439.807 rows=10 loops=1)
   CTE last
     ->  Recursive Union  (cost=0.43..6.52 rows=11 width=61) (actual time=0.098..4439.780 rows=10 loops=1)
           ->  Limit  (cost=0.43..0.47 rows=1 width=29) (actual time=0.095..0.095 rows=1 loops=1)
                 ->  Index Scan Backward using id_btree on profile_photos  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.093..0.093 rows=1 loops=1)
           ->  Limit  (cost=0.43..0.58 rows=1 width=61) (actual time=443.965..443.966 rows=1 loops=10)
                 ->  Nested Loop  (cost=0.43..1406983.38 rows=9510977 width=61) (actual time=443.964..443.964 rows=1 loops=10)
                       Join Filter: ((last_1.id > older.id) AND (ROW(older.username, older.profile_photo) <> ALL (last_1.a)))
                       Rows Removed by Join Filter: 8
                       ->  Index Scan Backward using id_btree on profile_photos older  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.008..167.755 rows=1000010 loops=10)
                       ->  WorkTable Scan on last last_1  (cost=0.00..0.25 rows=3 width=36) (actual time=0.000..0.000 rows=0 loops=10000102)
                             Filter: (array_length(a, 1) < 10)
                             Rows Removed by Filter: 1
 Total runtime: 4439.907 ms
(14 rows)

自Postgres 9.3以来,有一种新的JOIN类型,LATERAL JOIN。它允许您在行级别进行连接决策(即它“适用于每一行”)。我们可以使用它来实现以下逻辑:“只要我们没有N行,对于每个生成的行,查看是否存在比最后一行更旧的行,如果有,则将该行添加到生成的结果中”

WITH RECURSIVE last (id, username, profile_photo, a) AS (
    (SELECT id, username, profile_photo, ARRAY[ROW(username, profile_photo)] a
    FROM profile_photos
    ORDER BY id DESC
    LIMIT 1)
    UNION ALL
    (SELECT older.id, older.username, older.profile_photo, last.a || ROW(older.username, older.profile_photo)
    FROM last
    CROSS JOIN LATERAL (
        SELECT *
        FROM profile_photos older
        WHERE last.id > older.id AND NOT ROW(older.username, older.profile_photo) = ANY(last.a)
        ORDER BY id DESC
        LIMIT 1
    ) older
    WHERE array_length(a, 1) < 10
    ORDER BY id DESC
    LIMIT 1)
)
SELECT id, username, profile_photo
FROM last;

    id    | username |  profile_photo   
----------+----------+------------------
 10000000 | d1ca3aa  | urlphoto/d1ca3aa
  9999999 | 283f427  | urlphoto/283f427
  9999998 | facf1f3  | urlphoto/facf1f3
  9999997 | 305ebab  | urlphoto/305ebab
  9999996 | 74ab43a  | urlphoto/74ab43a
  9999995 | 23f2458  | urlphoto/23f2458
  9999994 | 6b465af  | urlphoto/6b465af
  9999993 | 33ee85a  | urlphoto/33ee85a
  9999992 | c0b9ef4  | urlphoto/c0b9ef4
  9999991 | b63d5bf  | urlphoto/b63d5bf
(10 rows)
Time: 1.966 ms

现在这很快......直到N太大了。

                                                                                        QUERY PLAN                                                                                        
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 CTE Scan on last  (cost=18.61..18.83 rows=11 width=68) (actual time=0.074..0.359 rows=10 loops=1)
   CTE last
     ->  Recursive Union  (cost=0.43..18.61 rows=11 width=61) (actual time=0.070..0.346 rows=10 loops=1)
           ->  Limit  (cost=0.43..0.47 rows=1 width=29) (actual time=0.067..0.068 rows=1 loops=1)
                 ->  Index Scan Backward using id_btree on profile_photos  (cost=0.43..333219.47 rows=9999869 width=29) (actual time=0.065..0.065 rows=1 loops=1)
           ->  Limit  (cost=1.79..1.79 rows=1 width=61) (actual time=0.026..0.026 rows=1 loops=10)
                 ->  Sort  (cost=1.79..1.80 rows=3 width=61) (actual time=0.025..0.025 rows=1 loops=10)
                       Sort Key: older.id
                       Sort Method: quicksort  Memory: 25kB
                       ->  Nested Loop  (cost=0.43..1.77 rows=3 width=61) (actual time=0.020..0.021 rows=1 loops=10)
                             ->  WorkTable Scan on last last_1  (cost=0.00..0.25 rows=3 width=36) (actual time=0.001..0.001 rows=1 loops=10)
                                   Filter: (array_length(a, 1) < 10)
                                   Rows Removed by Filter: 0
                             ->  Limit  (cost=0.43..0.49 rows=1 width=29) (actual time=0.017..0.017 rows=1 loops=9)
                                   ->  Index Scan Backward using id_btree on profile_photos older  (cost=0.43..161076.14 rows=3170326 width=29) (actual time=0.016..0.016 rows=1 loops=9)
                                         Index Cond: (last_1.id > id)
                                         Filter: (ROW(username, profile_photo) <> ALL (last_1.a))
                                         Rows Removed by Filter: 0
 Total runtime: 0.439 ms
(19 rows)