我在meta上看到了这个问题:https://meta.stackexchange.com/questions/33101/how-does-so-query-comments
我想直接记录并以适当的技术方式提出问题。
说我有2张桌子:
Posts id content parent_id (null for questions, question_id for answer) Comments id body is_deleted post_id upvotes date
注意:我认为这是SO的架构是如何设置的,答案有一个parent_id是问题,问题在那里是null。问题和答案存储在同一个表格中。
如何以最少的往返次数以非常有效的方式提取评论stackoverflow样式?
规则:
如果任何非规范化使得东西更好,它们是什么?哪些索引至关重要?
答案 0 :(得分:4)
我不打算使用SQL过滤评论(这可能会让你大吃一惊,因为我是一名SQL倡导者)。只需获取它们全部按CommentId排序,并在应用程序代码中过滤它们。
实际上,对于给定帖子,有超过五条评论非常罕见,因此需要对其进行过滤。在StackOverflow的10月数据转储中,78%的帖子有0或1个评论,97%的评论有5个或更少。只有20个帖子有&= 50条评论,只有两个帖子有超过100条评论。
因此编写复杂的SQL来进行这种过滤会在查询所有帖子时增加复杂性。我在适当时使用聪明的SQL,但这将是一分钱而且是愚蠢的。
你可以这样做:
SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;
但是这会为您提供q
和a
列的冗余副本,这很重要,因为这些列包含文本blob。将冗余文本从RDBMS复制到应用程序的额外成本变得非常重要。
因此,在两个查询中不执行此操作可能会更好。相反,假设客户端正在查看PostId = 1234的问题,请执行以下操作:
SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL
SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
ON (c.PostId = p.PostId);
然后在应用程序代码中对它们进行排序,通过引用的帖子收集它们,并在每个帖子中过滤掉超出五个最有趣的注释的额外注释。
我针对从10月份加载StackOverflow数据转储的MySQL 5.1数据库测试了这两个查询。第一个查询大约需要50秒。第二个查询非常即时(在我预先缓存了Posts
和Comments
表的索引之后)。
最重要的是坚持使用单个SQL查询获取所需的所有数据是一个人为的要求(可能基于一种误解,即对RDBMS发出查询的往返是必须在任何费用)。通常,单个查询是 less 有效的解决方案。您是否尝试在单个函数中编写所有应用程序代码? : - )
答案 1 :(得分:1)
使用:
WITH post_hierarchy AS (
SELECT p.id,
p.content,
p.parent_id,
1 AS post_level
FROM POSTS p
WHERE p.parent_id IS NULL
UNION ALL
SELECT p.id,
p.content,
p.parent_id,
ph.post_level + 1 AS post_level
FROM POSTS p
JOIN post_hierarchy ph ON ph.id = p.parent_id)
SELECT ph.id,
ph.post_level,
c.upvotes,
c.body
FROM COMMENTS c
JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date
要注意的事情:
TOP 5
添加到我发布的内容只会返回基于ORDER BY语句的前五行答案 2 :(得分:1)
真正的问题不是查询,而是模式,特别是聚簇索引。评论排序要求是你定义的那样是不可思议的(每个答案只有5个答案吗?)。我将这些要求解释为“每个帖子提取5条评论(回答或问题)”,并优先考虑被投票的那些,然后是新的。我知道这不是SO评论的表现,但你必须更准确地定义你的要求。
这是我的问题:
declare @postId int;
set @postId = ?;
with cteQuestionAndReponses as (
select post_id
from Posts
where post_id = @postId
union all
select post_id
from Posts
where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
select count(*) as CommentsCount
from Comments c
where is_deleted = 0
and c.post_id = p.post_id) as cc
outer apply (
select top(5) *
from Comments c
where is_deleted = 0
and p.post_id = c.post_id
order by upvotes desc, date desc
) as c
我的测试表中有14k个帖子和67k条评论,查询在7ms内获得帖子:
Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
SQL Server Execution Times:
CPU time = 0 ms, elapsed time = 7 ms.
以下是我测试过的架构:
create table Posts (
post_id int identity (1,1) not null
, content varchar(max) not null
, parent_id int null -- (null for questions, question_id for answer)
, constraint fkPostsParent_id
foreign key (parent_id)
references Posts(post_id)
, constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on
Posts(parent_id, post_id);
go
create table Comments (
comment_id int identity(1,1) not null
, body varchar(max) not null
, is_deleted bit not null default 0
, post_id int not null
, upvotes int not null default 0
, date datetime not null default getutcdate()
, constraint pkComments primary key nonclustered (comment_id)
, constraint fkCommentsPostId
foreign key (post_id)
references Posts(post_id)
);
create clustered index cdxComments on
Comments (is_deleted, post_id, upvotes, date, comment_id);
go
这是我的测试数据生成:
insert into Posts (content)
select 'Lorem Ipsum'
from master..spt_values;
insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
select top(checksum(newid(), p.post_id) % 10) Number
from master..spt_values) as r
where parent_id is NULL
insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
-- 5% deleted comments
, case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
, p.post_id
-- up to 10 upvotes
, abs(checksum(newid(), p.post_id, r.Number)) % 10
-- up to 1 year old posts
, dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate())
from Posts p
cross apply (
select top(abs(checksum(newid(), p.post_id)) % 10) Number
from master..spt_values) as r