我有一个复杂的查询无法在此sql fiddle中使用。
在我工作的应用中,我们会将用户Gmail与我们的数据库同步。我们将电子邮件存储在电子邮件表中,我们还有一个回复表,其中存储了一个列出电子邮件所有父回复的引用标题。
例如,如果我有这样的电子邮件:
id | subject | message_id
---------------------------------------------------------------------------------------------
1 | howzitgoin | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
回复表中没有记录:
现在,如果我们为此电子邮件导入回复,请执行以下操作:
id | subject | message_id
---------------------------------------------------------------------------------------------
2 | RE: howzitgoin | CAEBV8YTu_A6LtP_uGuQ-QSVj3zojWUiwcjGZpsPPEz1Pj3_i1A@mail.gmail.com
我们会将以下内容存储在回复表中:
email_id | message_id
------------------------------------------------------------------------------------------
2 | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
如果我们收到回复:
id | subject | message_id
---------------------------------------------------------------------------------------------
3 | RE: howzitgoin | 53856b88a2a09_23fa9605badd01601b@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
我们会将以下内容存储在回复表中:
email_id | message_id
---------------------------------------------------------------------------------------------
3 | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
3 | CAEBV8YTu_A6LtP_uGuQ-QSVj3zojWUiwcjGZpsPPEz1Pj3_i1A@mail.gmail.com
经过多次努力,我想出了这个问题:
WITH "ranked_replies" AS
(
SELECT "r"."email_id", "r"."message_id", "rnk"
FROM (SELECT *, rank() OVER (PARTITION BY "message_id" ORDER BY "email_id" DESC) AS "rnk" FROM "replies") AS "r"
INNER JOIN "emails"
ON ("emails"."message_id" = "r"."message_id")
),
"count_of_replies" AS
(
SELECT "email_id", count(*) AS "count", count(*) AS "thread_count"
FROM "ranked_replies"
GROUP BY "email_id"
)
SELECT DISTINCT "emails".*, "thread_count"
FROM "emails"
LEFT JOIN "count_of_replies"
ON ("emails"."id" = "count_of_replies"."email_id")
WHERE
(
("folder" = 'INBOX')
AND
(
("emails"."message_id" NOT IN (SELECT "message_id" FROM "ranked_replies" WHERE ("rnk" != 1)))
OR ("emails"."message_id" IS NULL)
)
AND ("emails"."id" NOT IN (SELECT "email_id" FROM "ranked_replies" WHERE ("rnk" != 1)))
)
ORDER BY "created_at" DESC LIMIT 50 OFFSET 0
问题是它没有返回主题为“不返回”的电子邮件主题。
原因是因为where子句的这一部分:
("emails"."message_id" NOT IN (SELECT "message_id" FROM "ranked_replies" WHERE ("rnk" != 1))
这排除了主题为“不返回”的根电子邮件,因为它在ranking_replies中有2行,排名为1和2。
我想要一个查询:
引用SQLFiddle示例:它应该返回电子邮件:#5(线程1的最高排名),#8(线程2的最高排名),#9(不在线程中)和#10(不是最高排名,但只有收件箱中的一个线程)
我在使用#10时遇到了麻烦。
答案 0 :(得分:2)
“未返回”电子邮件未返回,因为顶部的电子邮件有回复,而较低的电子邮件不在“收件箱”中。
作为sidenode:正如你将其作为“应用程序”一样,我想你可以通过将所有(至少直到特定数量)邮件的元数据传输到客户端并在那里进行排序/过滤来增强它。根据用户群等的不同,这很可能比数据库更好地扩展(并且用户体验更快)。回到你的问题:
我很不确定你为什么要使用rank()
,所以我放弃了它。如果您因为我忽略的其他要求而继续使用它,您可以这样做:在“回复”的子选择上使用排名,该回复仅处理回复在当前文件夹中的行。
也许您想跳过我的解决方案中“threadid”的选择,并通过首先插入此ID来解决此问题。或者您自己为每个帖子创建一个唯一的ID。
由于我不知道gmail显示邮件的方式,我认为你需要以下内容:
我创建了this SQL Fiddle来做到这一点。 在那里,我还更改了您的数据库模型,以引用由其主键(id)而不是其message_id回复的消息。 因为这是一个数字序列,它可以用来解决线程树,我这样做了。
这是解决方案:
WITH "thread" AS
( -- select the uppermost id per thread
SELECT r."email_id", min("reply_to_id") AS "threadid"
FROM "replies" r
INNER JOIN "emails" e ON r.email_id = e.id
-- create tree only for current folder
AND e.folder = 'INBOX'
GROUP BY r."email_id"
),
"lastmail" AS
( -- select the highest email per thread
SELECT t."threadid", max(r."email_id") AS "lastmail"
FROM replies r
INNER JOIN thread t ON t.threadid = r.reply_to_id
GROUP BY t."threadid"
),
"count_of_replies" AS
(SELECT r."email_id", count(r.*) AS "thread_count"
FROM replies r
INNER JOIN thread t ON t.threadid = r.reply_to_id
GROUP BY r."email_id")
SELECT DISTINCT "emails".*, "thread_count"
FROM "emails"
LEFT JOIN "count_of_replies"
ON ("emails"."id" = "count_of_replies"."email_id")
WHERE
(
-- only from current folder
("folder" = 'INBOX') AND
(
-- the ones that are in no thread
("emails"."id" NOT IN (SELECT "email_id" FROM "thread"
UNION ALL
SELECT "threadid" from "thread"))
OR
-- the ones that are top in their thread
("emails"."id" IN (SELECT "lastmail" FROM "lastmail"))
)
)
ORDER BY "created_at" DESC LIMIT 50 OFFSET 0
答案 1 :(得分:1)
正如评论建议的那样,您可以更改架构,而且您正在寻找我写的gmail列表,请考虑以下内容:< / p>
CREATE TABLE emails
(
id serial PRIMARY KEY,
thread_id serial NOT NULL,
parent_id int REFERENCES emails (id),
subject text,
created_at timestamp without time zone,
folder text,
message_id text
);
使用此模式,您可以获得每个线程的最新帖子以及每个线程中的电子邮件数量,而无需使用任何连接:使用窗口函数进行简单扫描即可。
当您的应用未设置thread_id
时,它会自动增加,从而产生新的主题。这将进一步允许您在将来根据需要将线程拆分为两个。
如果您想进一步优化架构,请添加额外的两列,即is_latest
和num_messages
:
CREATE TABLE emails
(
id serial PRIMARY KEY,
parent_id int REFERENCES emails (id),
thread_id serial NOT NULL,
is_latest boolean NOT NULL default true,
num_messages int NOT NULL default 1,
subject text,
created_at timestamp without time zone,
folder text,
message_id text
);
然后,使用:
在parent_id is not null or num_messages = 0
的行上插入触发器之前,从线程中选择最新的,并将new.num_messages
设置为该行num_messages + 1
。
插入后触发器,更新线程中的其他行并应用所需的副作用,即设置is_latest
和num_messages
。
更新和删除语句之前和之后的触发器,以便在例如适当时保持两列相似。删除电子邮件,更改日期,或将其及其兄弟姐妹移至其他主题。
注意:不要在前触发器上放置影响其他行的副作用,并且要小心在更新触发器中使触发器循环。
使用此优化架构,您可以在created_at上放置部分索引is_latest
,并在不使用任何连接,聚合或窗口函数的情况下收集所有需要的信息。
答案 2 :(得分:0)
ALTER TABLE emails
ADD COLUMN thread_id INTEGER REFERENCES emails(id)
, ADD COLUMN previous_id INTEGER REFERENCES emails(id)
;
-- Initially, all messages are in their own private thread.
UPDATE emails SET thread_id = id;
WITH www AS (
SELECT DISTINCT email_id AS email_id
-- Find the oldest and newest reference per id
, first_value(e.id) OVER w AS mmin
, last_value(e.id) OVER w AS mmax
FROM replies r
JOIN emails e ON e.message_id = r.message_id
WINDOW w AS (PARTITION BY r.email_id ORDER BY e.created_at ASC)
)
UPDATE emails dst
SET thread_id = www.mmin
, previous_id = www.mmax
FROM www
WHERE dst.id = www.email_id
;
SELECT id, thread_id, previous_id
, rank() over (PARTITION BY thread_id ORDER BY created_at) AS rnk
, subject, created_at, folder ,message_id
FROM emails
ORDER BY thread_id,created_at,id;
结果:
ALTER TABLE
UPDATE 12
UPDATE 8
UPDATE 8
id | thread_id | previous_id | rnk | subject | created_at | folder | message_id
----+-----------+-------------+-----+---------------------+----------------------------+--------+----------------------------------------------------------------------------
1 | 1 | | 1 | howzitgoin | 2014-06-22 16:53:56.168109 | | 53856b1448c89_23fa9605badd015951@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
2 | 1 | 1 | 2 | Re: howzitgoin | 2014-06-22 17:03:56.168109 | INBOX | CAEBV8YTu_A6LtP_uGuQ-QSVj3zojWUiwcjGZpsPPEz1Pj3_i1A@mail.gmail.com
3 | 1 | 1 | 3 | Re: howzitgoin | 2014-06-22 17:13:56.168109 | INBOX | 53856b88a2a09_23fa9605badd01601b@3a139e8c-0b81-42c2-8e59-133c262e96a9.mail
4 | 1 | 1 | 4 | Re: howzitgoin | 2014-06-22 17:23:56.168109 | INBOX | CAEBV8YT6vx6buOKUga4f=bcNGq_=WzwiqEzm2FWm3HoLZ8SbJA@mail.gmail.com
5 | 1 | 3 | 5 | Re: howzitgoin | 2014-06-22 17:33:56.168109 | INBOX | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
6 | 6 | | 1 | thready mercury | 2014-06-22 17:43:56.168109 | INBOX | 1401340841.22951.YahooMailNeo@web171602.mail.ir2.yahoo.com
7 | 6 | 6 | 2 | RE: thready mercury | 2014-06-22 17:53:56.168109 | | 5386c3c3f364d_23ff44def2de8849cf@55cedd07-f558-4cc4-b5d2-046dc7642b91.mail
8 | 6 | 7 | 3 | RE: thready mercury | 2014-06-22 18:03:56.168109 | INBOX | 1401340888.34275.YahooMailNeo@web171605.mail.ir2.yahoo.com
9 | 9 | | 1 | Not part of thread | 2014-06-22 16:43:56.168109 | INBOX | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
10 | 10 | | 1 | not returning | 2014-06-22 16:03:56.168109 | INBOX | CAEBV8YQ58oXAm9wqLz18J5RsV2fN9u__bnsp_Z8qhdEJpmt-EQ@mail.gmail.com
11 | 10 | 10 | 2 | RE: not returning | 2014-06-22 16:13:56.168109 | | 538f82fc7e661_23f9fbee7c84825682@3b0c9abe-8e87-410f-895e-c132ff5f4be3.mail
12 | 10 | 10 | 3 | RE: not returning | 2014-06-22 16:23:56.168109 | | 539024ecaa478_23fb766d5ca209988b@24b867e3-62f2-491b-a58a-660c87d0be57.mail
(12 rows)
更新:看起来OP只是每个帖子中最新的电子邮件(他应该把它放在问题中)。对于该线程,具有最大rank()
的那个,或者没有lead()
值的那个:
WITH xxx AS (
SELECT id, thread_id, previous_id
, rank() over (PARTITION BY thread_id ORDER BY created_at) AS rnk
, lead(id) over (PARTITION BY thread_id ORDER BY created_at ) AS nxt
, subject, created_at, folder ,message_id
FROM emails
)
SELECT id, thread_id, previous_id, rnk, subject, created_at, folder ,message_id
FROM xxx
WHERE nxt IS NULL
ORDER BY thread_id,created_at,id;
结果:
id | thread_id | previous_id | rnk | subject | created_at | folder | message_id
----+-----------+-------------+-----+---------------------+----------------------------+--------+----------------------------------------------------------------------------
5 | 1 | 2 | 5 | Re: howzitgoin | 2014-06-25 12:37:58.263205 | INBOX | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
8 | 6 | 6 | 3 | RE: thready mercury | 2014-06-25 13:07:58.263205 | INBOX | 1401340888.34275.YahooMailNeo@web171605.mail.ir2.yahoo.com
9 | 9 | | 1 | Not part of thread | 2014-06-25 11:47:58.263205 | INBOX | CAEBV8YRGwkNSb3cxquS5abSCmnwLn37GxCpg74mQe7=3SC5cdQ@mail.gmail.com
12 | 10 | 10 | 3 | RE: not returning | 2014-06-25 11:27:58.263205 | | 539024ecaa478_23fb766d5ca209988b@24b867e3-62f2-491b-a58a-660c87d0be57.mail
(4 rows)
请注意,rnk
字段实际上是线程的最高排名==线程中的消息数。