Question

我在meta上看到了这个问题：https://meta.stackexchange.com/questions/33101/how-does-so-query-comments

我想直接记录并以适当的技术方式提出问题。

说我有2张桌子：

Posts
 id
 content
 parent_id           (null for questions, question_id for answer)  

Comments
 id 
 body 
 is_deleted
 post_id 
 upvotes 
 date

注意：我认为这是SO的架构是如何设置的，答案有一个parent_id是问题，问题在那里是null。问题和答案存储在同一个表格中。

如何以最少的往返次数以非常有效的方式提取评论stackoverflow样式？

规则：

单个查询应提取包含多个帖子的网页所需的所有评论
每个答案只需提取5条评论，并为upvotes提供优惠
需要提供足够的信息来通知用户除了5之外还有更多的评论。（和实际计数 - 例如，2个以上的评论）
您可以在此问题的评论中看到排序对于评论非常多毛。规则是，按日期显示评论， HOWEVER 如果评论有一个upvote，它将获得优惠待遇并显示在列表的底部。（这很难在sql中表达）

如果任何非规范化使得东西更好，它们是什么？哪些索引至关重要？

Answer 1

我不打算使用SQL过滤评论（这可能会让你大吃一惊，因为我是一名SQL倡导者）。只需获取它们全部按CommentId排序，并在应用程序代码中过滤它们。

实际上，对于给定帖子，有超过五条评论非常罕见，因此需要对其进行过滤。在StackOverflow的10月数据转储中，78％的帖子有0或1个评论，97％的评论有5个或更少。只有20个帖子有＆= 50条评论，只有两个帖子有超过100条评论。

因此编写复杂的SQL来进行这种过滤会在查询所有帖子时增加复杂性。我在适当时使用聪明的SQL，但这将是一分钱而且是愚蠢的。

你可以这样做：

SELECT q.PostId, a.PostId, c.CommentId
FROM Posts q
LEFT OUTER JOIN Posts a
  ON (a.ParentId = q.PostId)
LEFT OUTER JOIN Comments c
  ON (c.PostId IN (q.PostId, a.PostId))
WHERE q.PostId = 1234
ORDER BY q.PostId, a.PostId, c.CommentId;

但是这会为您提供q和a列的冗余副本，这很重要，因为这些列包含文本blob。将冗余文本从RDBMS复制到应用程序的额外成本变得非常重要。

因此，在两个查询中不执行此操作可能会更好。相反，假设客户端正在查看PostId = 1234的问题，请执行以下操作：

SELECT c.PostId, c.Text
FROM Comments c
JOIN (SELECT 1234 AS PostId UNION ALL 
    SELECT a.PostId FROM Posts a WHERE a.ParentId = 1234) p
  ON (c.PostId = p.PostId);

然后在应用程序代码中对它们进行排序，通过引用的帖子收集它们，并在每个帖子中过滤掉超出五个最有趣的注释的额外注释。

我针对从10月份加载StackOverflow数据转储的MySQL 5.1数据库测试了这两个查询。第一个查询大约需要50秒。第二个查询非常即时（在我预先缓存了Posts和Comments表的索引之后）。

最重要的是坚持使用单个SQL查询获取所需的所有数据是一个人为的要求（可能基于一种误解，即对RDBMS发出查询的往返是必须在任何费用）。通常，单个查询是 less 有效的解决方案。您是否尝试在单个函数中编写所有应用程序代码？： - ）

Answer 2

使用：

WITH post_hierarchy AS (
  SELECT p.id,
         p.content,
         p.parent_id,
         1 AS post_level
    FROM POSTS p
   WHERE p.parent_id IS NULL
  UNION ALL
  SELECT p.id,
         p.content,
         p.parent_id,
         ph.post_level + 1 AS post_level
    FROM POSTS p
    JOIN post_hierarchy ph ON ph.id = p.parent_id)  
SELECT ph.id, 
       ph.post_level,
       c.upvotes,
       c.body
  FROM COMMENTS c
  JOIN post_hierarchy ph ON ph.id = c.post_id
ORDER BY ph.post_level, c.date

要注意的事情：

StackOverflow显示前5条评论，如果它们被赞成则无关紧要。随后的评论会被立即显示
如果不为每个帖子投入SELECT，则每个帖子不能容纳5条评论的限制。将TOP 5添加到我发布的内容只会返回基于ORDER BY语句的前五行

Answer 3

真正的问题不是查询，而是模式，特别是聚簇索引。评论排序要求是你定义的那样是不可思议的（每个答案只有5个答案吗？）。我将这些要求解释为“每个帖子提取5条评论（回答或问题）”，并优先考虑被投票的那些，然后是新的。我知道这不是SO评论的表现，但你必须更准确地定义你的要求。

这是我的问题：

declare @postId int;
set @postId = ?;

with cteQuestionAndReponses as (
  select post_id
  from Posts
  where post_id = @postId
  union all
  select post_id
  from Posts
  where parent_id = @postId)
select * from
cteQuestionAndReponses p
outer apply (
  select count(*) as CommentsCount
  from Comments c 
  where is_deleted = 0
  and c.post_id = p.post_id) as cc
outer apply (
  select top(5) *
  from Comments c 
  where is_deleted = 0
  and p.post_id = c.post_id
  order by upvotes desc, date desc
  ) as c

我的测试表中有14k个帖子和67k条评论，查询在7ms内获得帖子：

Table 'Comments'. Scan count 12, logical reads 50, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Posts'. Scan count 1, logical reads 5, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 7 ms.

以下是我测试过的架构：

create table Posts (
 post_id int identity (1,1) not null
 , content varchar(max) not null
 , parent_id int null -- (null for questions, question_id for answer) 
 , constraint fkPostsParent_id 
    foreign key (parent_id)
    references Posts(post_id)
 , constraint pkPostsId primary key nonclustered (post_id)
);
create clustered index cdxPosts on 
  Posts(parent_id, post_id);
go

create table Comments (
 comment_id int identity(1,1) not null
 , body varchar(max) not null
 , is_deleted bit not null default 0
 , post_id int not null
 , upvotes int not null default 0
 , date datetime not null default getutcdate()
 , constraint pkComments primary key nonclustered (comment_id)
 , constraint fkCommentsPostId
    foreign key (post_id)
    references Posts(post_id)
 );
create clustered index cdxComments on 
  Comments (is_deleted, post_id,  upvotes, date, comment_id);
go

这是我的测试数据生成：

insert into Posts (content)
select 'Lorem Ipsum' 
from master..spt_values;

insert into Posts (content, parent_id)
select 'Ipsum Lorem', post_id
from Posts p
cross apply (
  select top(checksum(newid(), p.post_id) % 10) Number
  from master..spt_values) as r
where parent_id is NULL  

insert into Comments (body, is_deleted, post_id, upvotes, date)
select 'Sit Amet'
  -- 5% deleted comments
  , case when abs(checksum(newid(), p.post_id, r.Number)) % 100 > 95 then 1 else 0 end
  , p.post_id
  -- up to 10 upvotes
  , abs(checksum(newid(), p.post_id, r.Number)) % 10
  -- up to 1 year old posts
  , dateadd(minute, -abs(checksum(newid(), p.post_id, r.Number) % 525600), getutcdate()) 
from Posts p
cross apply (
  select top(abs(checksum(newid(), p.post_id)) % 10) Number
  from master..spt_values) as r

你如何查询评论stackoverflow风格？

3 个答案: