我有一个Sqlite数据库,里面有近500,000行的访问日志信息。我正在使用它来获取聚合信息,例如“每个ip访问网站的次数”,或“命中百分比是POST”等。
我写了一个SQL查询,收集每个IP地址到达网站的次数,其中出现的次数大于IP地址计数的1%。
select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
这将返回大约7个重要的IP地址。我如何将“所有其他”行合并到结果集?
我尝试使用逻辑相反的UNIONing
select "All Others", count(ip_address)
from records
group by ip_address
having count(ip_address) < (select count(ip_address) from records) * .01
但这会返回多个“所有其他”行,并且计数是连续的。
答案 0 :(得分:1)
当然要使用union all
..但这并没有回答问题&#34;。
这个问题是第二个查询&#34;返回多个&#34; (就像第一个查询一样)因为group by
是IP,其中有很多。也就是说,每个组都有一个结果元组 ,与select输出子句中的任何操作无关。
期望的目标可能是将外部选择与计数相加。
-- union all
select "All Others", sum(t.ct)
from (
select count(ip_address) as ct
from records
group by ip_address
-- note: <=, and not <, is inverse of >
having count(ip_address) <= (select count(ip_address) from records) * .01
) t
当然,如果&#39;总计&#39;和&#39;发现&#39;众所周知,其他人&#39;是&#39;总计&#39; - &#39;发现&#39;。
计数是连续的,而有趣的观察是无关紧要的。请记住,当没有order by
应用于具体化结果集时,SQL可以以任何顺序返回行(在子选择中order by
不是严格保证的。)
答案 1 :(得分:1)
您可以使用变量来保存此信息吗?
DECLARE @num INT
SET @num = (select count(*)
from records
group by ip_address
having count(*) > (select count(ip_address) from records) * .01)
然后进行常规查询
select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
UNION
select "All Others", count(ip_address)-@num
from records
答案 2 :(得分:0)
没有CTE,这可能是最好的(我不确定sqlite允许的是什么)。使用not in
可以防止您必须编写与您的条件相反的情况,在其他情况下可能会因为空值或浮点数学考虑而更复杂:
select ip_address, count(ip_address)
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
union all
select 'All others', count(*)
from records
where ip_address not in (
select ip_address /* assuming non-null ip_address */
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
)
否则:
with topPercent as (
select ip_address, count(ip_address) as addr_cnt
from records
group by ip_address
having count(ip_address) > (select count(ip_address) from records) * .01
)
select ip_address, addr_cnt from topPercent
union all
select 'All others', count(distinct ip_address) - (select count(*) from topPercent)
如果分析函数可用,则第三个选项可能最快:
select case when pct > 0.01 then ip_address else 'All others' end, sum(addr_cnt)
from (
select ip_address, addr_cnt, addr_cnt * 1.0e / sum(addr_cnt) over () as pct
from (
select ip_address, count(ip_address) as addr_cnt
from records
group by ip_address
) T1
) T2
group by case when pct > 0.01 then ip_address else 'All others' end