Question

假设我们有一个名为 actions(date, uid, pid, action, description) 的表。该表的示例如下所示：

Table: actions
date         uid        pid         action            description
'2018-10-19' 1234        12           'view'         
'2018-10-19' 1234        12           'report'        'SPAM' 
'2018-10-19' 5678        23           'reaction'      'LOVE'

有一个表也称为 reviewers(date, rid, pid) 。评论者是删除帖子的人。评论者不是用户。该表的示例如下所示：

Table: reviewers
 date         rid                    pid
'2018-10-19'  567                    12
'2018-10-19'  890                    45

用户观看（采取任何操作）的日常内容中有多少实际上是垃圾邮件？

会做以下工作：

案例1：“看着”指的是任何动作

select u.date, count(distinct r.pid)/count(distinct uu.pid))*100
from actions u join actions uu
on u.pid = uu.pid
inner join reviewers r
on u.pid = r.pid
where u.description = 'SPAM'
group by 1

案例2：“看着”表示操作=“查看”

 select u.date, count(distinct r.pid)/count(distinct uu.pid))*100
    from actions u join actions uu       
    on u.pid = uu.pid
    inner join reviewers r
    on u.pid = r.pid
    where u.description = 'SPAM'
    and uu.action = 'VIEW'
    group by 1

Answer 1

您不需要两次join。如果我理解正确：

select u.date, 
       avg(case when u.description = 'SPAM' then 1.0 else 0 end)
from actions u left join
     reviewers r
     on u.pid = r.pid
group by u.date;

嗯。。。您需要先汇总才能加入。所以这可能更好：

select u.date, 
       avg(case when u.description = 'SPAM' then 1.0 else 0 end)
from (select date, uid, pid,
             max(case when u.description = 'SPAM' then 1 else 0 end) as is_spam
      from actions u
      group by date, uid, pid
     ) u left join
     reviewers r
     on u.pid = r.pid
group by u.date;

Answer 2

我不确定为什么需要考虑reviewers，或者在该表中是否可以重复使用pid，但是我认为这可以满足您的需要（样本中的50.0％）

select 
    count(distinct (case when description = 'SPAM' and r.pid IS NOT NULL then pid end)) * 100.0
     /
    count(distinct pid) 
from actions a
left join (
     select distinct pid from reviewers 
     ) r on r.pid = a.pid
;

自我加入与效率与子查询

2 个答案: