用于返回电子邮件会话列表的复杂查询

时间:2014-06-18 09:26:51

标签: postgresql psql

我有一个复杂的查询无法在此sql fiddle中使用。

在我工作的应用中,我们会将用户Gmail与我们的数据库同步。我们将电子邮件存储在电子邮件表中,我们还有一个回复表,其中存储了一个列出电子邮件所有父回复的引用标题。

例如,如果我有这样的电子邮件:

id  | subject     | message_id
---------------------------------------------------------------------------------------------
1   | howzitgoin  | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail

回复表中没有记录:

现在,如果我们为此电子邮件导入回复,请执行以下操作:

    id  | subject     | message_id  
---------------------------------------------------------------------------------------------
    2   | RE: howzitgoin  | CAEBV8YTu_A6LtP_uGuQ-QSVj3zojWUiwcjGZpsPPEz1Pj3_i1A@mail.gmail.com

我们会将以下内容存储在回复表中:

    email_id | message_id
------------------------------------------------------------------------------------------
       2     | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail

如果我们收到回复:

id  | subject     | message_id
---------------------------------------------------------------------------------------------
3   | RE: howzitgoin  | 53856b88a2a09_23fa9605badd01601b@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail

我们会将以下内容存储在回复表中:

email_id | message_id
---------------------------------------------------------------------------------------------
   3     | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
   3     | CAEBV8YTu_A6LtP_uGuQ-QSVj3zojWUiwcjGZpsPPEz1Pj3_i1A@mail.gmail.com

经过多次努力,我想出了这个问题:

    WITH "ranked_replies" AS 
    (
        SELECT "r"."email_id", "r"."message_id", "rnk" 
        FROM (SELECT *, rank() OVER (PARTITION BY "message_id" ORDER BY "email_id" DESC) AS "rnk" FROM "replies") AS "r" 
        INNER JOIN "emails" 
        ON ("emails"."message_id" = "r"."message_id") 
    ), 
    "count_of_replies" AS 
    (
        SELECT "email_id", count(*) AS "count", count(*) AS "thread_count" 
        FROM "ranked_replies" 
        GROUP BY "email_id"
    ) 
    SELECT DISTINCT "emails".*, "thread_count" 
    FROM "emails" 
    LEFT JOIN "count_of_replies" 
    ON ("emails"."id" = "count_of_replies"."email_id") 
    WHERE 
    (
        ("folder" = 'INBOX') 
        AND 
        (
            ("emails"."message_id" NOT IN (SELECT "message_id" FROM "ranked_replies" WHERE ("rnk" != 1))) 

            OR ("emails"."message_id" IS NULL)
        ) 
        AND ("emails"."id" NOT IN (SELECT "email_id" FROM "ranked_replies" WHERE ("rnk" != 1)))
    ) 
    ORDER BY "created_at" DESC LIMIT 50 OFFSET 0

问题是它没有返回主题为“不返回”的电子邮件主题。

原因是因为where子句的这一部分:

("emails"."message_id" NOT IN (SELECT "message_id" FROM "ranked_replies" WHERE ("rnk" != 1))

这排除了主题为“不返回”的根电子邮件,因为它在ranking_replies中有2行,排名为1和2。

我想要一个查询:

  1. 显示没有回复的电子邮件(例如,不在帖子中)
  2. 显示每个线程的顶端,如果每个线程的多个端节点存在,我只需要一个。
  3. 仅显示当前文件夹(收件箱)中的电子邮件。
  4. 引用SQLFiddle示例:它应该返回电子邮件:#5(线程1的最高排名),#8(线程2的最高排名),#9(不在线程中)和#10(不是最高排名,但只有收件箱中的一个线程)

    我在使用#10时遇到了麻烦。

3 个答案:

答案 0 :(得分:2)

“未返回”电子邮件未返回,因为顶部的电子邮件有回复,而较低的电子邮件不在“收件箱”中。

作为sidenode:正如你将其作为“应用程序”一样,我想你可以通过将所有(至少直到特定数量)邮件的元数据传输到客户端并在那里进行排序/过滤来增强它。根据用户群等的不同,这很可能比数据库更好地扩展(并且用户体验更快)。回到你的问题:

我很不确定你为什么要使用rank(),所以我放弃了它。如果您因为我忽略的其他要求而继续使用它,您可以这样做:在“回复”的子选择上使用排名,该回复仅处理回复在当前文件夹中的行。

也许您想跳过我的解决方案中“threadid”的选择,并通过首先插入此ID来解决此问题。或者您自己为每个帖子创建一个唯一的ID。

由于我不知道gmail显示邮件的方式,我认为你需要以下内容:

  1. 显示没有回复的电子邮件(例如,不在帖子中)
  2. 显示每个线程的顶端,如果每个线程的多个端节点存在,我假设您只需要一个。
  3. 只关心当前文件夹中的电子邮件。
  4. 我创建了this SQL Fiddle来做到这一点。 在那里,我还更改了您的数据库模型,以引用由其主键(id)而不是其message_id回复的消息。 因为这是一个数字序列,它可以用来解决线程树,我这样做了。

    这是解决方案:

    WITH "thread" AS 
    ( -- select the uppermost id per thread
        SELECT r."email_id", min("reply_to_id") AS "threadid"
        FROM "replies" r
        INNER JOIN "emails" e ON r.email_id = e.id 
        -- create tree only for current folder
        AND e.folder = 'INBOX'
        GROUP BY r."email_id"
    ),
    "lastmail" AS
    ( -- select the highest email per thread
      SELECT t."threadid", max(r."email_id") AS "lastmail"
      FROM replies r
      INNER JOIN thread t ON t.threadid = r.reply_to_id
      GROUP BY t."threadid"
    ),
    "count_of_replies" AS
    (SELECT r."email_id", count(r.*) AS "thread_count"
     FROM replies r
      INNER JOIN thread t ON t.threadid = r.reply_to_id
      GROUP BY r."email_id")
    SELECT DISTINCT "emails".*, "thread_count"
    FROM "emails" 
    LEFT JOIN "count_of_replies" 
    ON ("emails"."id" = "count_of_replies"."email_id") 
    WHERE 
    (
       -- only from current folder
        ("folder" = 'INBOX') AND
        (
          -- the ones that are in no thread
          ("emails"."id" NOT IN (SELECT "email_id" FROM "thread"
                                 UNION ALL
                                 SELECT "threadid" from "thread"))
          OR
          -- the ones that are top in their thread
          ("emails"."id" IN (SELECT "lastmail" FROM "lastmail"))
        )
    ) 
    ORDER BY "created_at" DESC LIMIT 50 OFFSET 0
    

答案 1 :(得分:1)

正如评论建议的那样,您可以更改架构,而且您正在寻找我写的gmail列表,请考虑以下内容:< / p>

CREATE TABLE emails
(
  id serial PRIMARY KEY,
  thread_id serial NOT NULL,
  parent_id int REFERENCES emails (id),
  subject text,
  created_at timestamp without time zone,
  folder text,
  message_id text
);

使用此模式,您可以获得每个线程的最新帖子以及每个线程中的电子邮件数量,而无需使用任何连接:使用窗口函数进行简单扫描即可。

当您的应用未设置thread_id时,它会自动增加,从而产生新的主题。这将进一步允许您在将来根据需要将线程拆分为两个。

如果您想进一步优化架构,请添加额外的两列,即is_latestnum_messages

CREATE TABLE emails
(
  id serial PRIMARY KEY,
  parent_id int REFERENCES emails (id),
  thread_id serial NOT NULL,
  is_latest boolean NOT NULL default true,
  num_messages int NOT NULL default 1,
  subject text,
  created_at timestamp without time zone,
  folder text,
  message_id text
);

然后,使用:

  1. parent_id is not null or num_messages = 0的行上插入触发器之前,从线程中选择最新的,并将new.num_messages设置为该行num_messages + 1

  2. 插入后触发器,更新线程中的其他行并应用所需的副作用,即设置is_latestnum_messages

  3. 更新和删除语句之前和之后的触发器,以便在例如适当时保持两列相似。删除电子邮件,更改日期,或将其及其兄弟姐妹移至其他主题。

  4. 注意:不要在前触发器上放置影响其他行的副作用,并且要小心在更新触发器中使触发器循环。

    使用此优化架构,您可以在created_at上放置部分索引is_latest,并在不使用任何连接,聚合或窗口函数的情况下收集所有需要的信息。

答案 2 :(得分:0)

ALTER TABLE emails
        ADD COLUMN thread_id INTEGER REFERENCES emails(id)
        , ADD COLUMN previous_id INTEGER REFERENCES emails(id)
        ;
        -- Initially, all messages are in their own private thread.
UPDATE emails SET thread_id = id;


WITH www AS (
        SELECT DISTINCT email_id AS email_id
        -- Find the oldest and newest reference per id
        , first_value(e.id) OVER w AS mmin
        , last_value(e.id) OVER w AS mmax
        FROM replies r
        JOIN emails e ON e.message_id = r.message_id
        WINDOW w AS (PARTITION BY r.email_id ORDER BY e.created_at ASC)
        )
UPDATE emails dst
SET thread_id = www.mmin
        , previous_id = www.mmax
FROM www
WHERE dst.id = www.email_id
        ;


SELECT  id, thread_id, previous_id
        , rank() over (PARTITION BY thread_id ORDER BY created_at) AS rnk
        , subject, created_at, folder ,message_id
FROM emails
ORDER BY thread_id,created_at,id;

结果:

ALTER TABLE
UPDATE 12
UPDATE 8
UPDATE 8
 id | thread_id | previous_id | rnk |       subject       |         created_at         | folder |                                 message_id                                 
----+-----------+-------------+-----+---------------------+----------------------------+--------+----------------------------------------------------------------------------
  1 |         1 |             |   1 | howzitgoin          | 2014-06-22 16:53:56.168109 |        | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
  2 |         1 |           1 |   2 | Re: howzitgoin      | 2014-06-22 17:03:56.168109 | INBOX  | CAEBV8YTu_A6LtP_uGuQ-QSVj3zojWUiwcjGZpsPPEz1Pj3_i1A@mail.gmail.com
  3 |         1 |           1 |   3 | Re: howzitgoin      | 2014-06-22 17:13:56.168109 | INBOX  | 53856b88a2a09_23fa9605badd01601b@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
  4 |         1 |           1 |   4 | Re: howzitgoin      | 2014-06-22 17:23:56.168109 | INBOX  | CAEBV8YT6vx6buOKUga4f=bcNGq_=WzwiqEzm2FWm3HoLZ8SbJA@mail.gmail.com
  5 |         1 |           3 |   5 | Re: howzitgoin      | 2014-06-22 17:33:56.168109 | INBOX  | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
  6 |         6 |             |   1 | thready mercury     | 2014-06-22 17:43:56.168109 | INBOX  | 1401340841.22951.YahooMailNeo@web171602.mail.ir2.yahoo.com
  7 |         6 |           6 |   2 | RE: thready mercury | 2014-06-22 17:53:56.168109 |        | 5386c3c3f364d_23ff44def2de8849cf@55cedd07-f558-4cc4-b5d2-046dc7642b91.mail
  8 |         6 |           7 |   3 | RE: thready mercury | 2014-06-22 18:03:56.168109 | INBOX  | 1401340888.34275.YahooMailNeo@web171605.mail.ir2.yahoo.com
  9 |         9 |             |   1 | Not part of thread  | 2014-06-22 16:43:56.168109 | INBOX  | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
 10 |        10 |             |   1 | not returning       | 2014-06-22 16:03:56.168109 | INBOX  | CAEBV8YQ58oXAm9wqLz18J5RsV2fN9u__bnsp_Z8qhdEJpmt-EQ@mail.gmail.com
 11 |        10 |          10 |   2 | RE: not returning   | 2014-06-22 16:13:56.168109 |        | 538f82fc7e661_23f9fbee7c84825682@3b0c9abe-8e87-410f-895e-c132ff5f4be3.mail
 12 |        10 |          10 |   3 | RE: not returning   | 2014-06-22 16:23:56.168109 |        | 539024ecaa478_23fb766d5ca209988b@24b867e3-62f2-491b-a58a-660c87d0be57.mail
(12 rows)

更新:看起来OP只是每个帖子中最新的电子邮件(他应该把它放在问题中)。对于该线程,具有最大rank()的那个,或者没有lead()值的那个:

WITH xxx AS (
SELECT  id, thread_id, previous_id
        , rank() over (PARTITION BY thread_id ORDER BY created_at) AS rnk
        , lead(id) over (PARTITION BY thread_id ORDER BY created_at ) AS nxt
        , subject, created_at, folder ,message_id
        FROM emails
        )
SELECT id, thread_id, previous_id, rnk, subject, created_at, folder ,message_id
FROM xxx
WHERE nxt IS NULL
ORDER BY thread_id,created_at,id;

结果:

 id | thread_id | previous_id | rnk |       subject       |         created_at         | folder |                                 message_id                                 
----+-----------+-------------+-----+---------------------+----------------------------+--------+----------------------------------------------------------------------------
  5 |         1 |           2 |   5 | Re: howzitgoin      | 2014-06-25 12:37:58.263205 | INBOX  | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
  8 |         6 |           6 |   3 | RE: thready mercury | 2014-06-25 13:07:58.263205 | INBOX  | 1401340888.34275.YahooMailNeo@web171605.mail.ir2.yahoo.com
  9 |         9 |             |   1 | Not part of thread  | 2014-06-25 11:47:58.263205 | INBOX  | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
 12 |        10 |          10 |   3 | RE: not returning   | 2014-06-25 11:27:58.263205 |        | 539024ecaa478_23fb766d5ca209988b@24b867e3-62f2-491b-a58a-660c87d0be57.mail
(4 rows)

请注意,rnk字段实际上是线程的最高排名==线程中的消息数。