假设我有300亿行有多列,我想独立地有效地找到每个列的前N个最频繁的值,并且可以使用最优雅的SQL。例如,如果我有
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Ferris Freemont Possum Ubik
Nancy Freemont Lemur Housekeeping
Nancy Drew Penguin Ubik
Bill Ribbits Lemur Dhalgren
我希望top-1,然后结果将是:
FirstName LastName FavoriteAnimal FavoriteBook
--------- -------- -------------- ------------
Nancy Freemont Lemur Ubik
我可能想办法做到这一点,但不确定它们是否是最优的,这在300亿行时很重要;并且SQL可能很大而且很丑,可能会使用太多的临时空间。
使用Oracle。
答案 0 :(得分:5)
这应该只对表格进行一次传递。您可以使用count()
的分析版本来独立获取每个值的频率:
select firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table;
FIRSTN C_FN LASTNAME C_LN FAVORIT C_FA FAVORITEBOOK C_FB
------ ---- -------- ---- ------- ---- ------------ ----
Bill 1 Ribbits 1 Lemur 2 Dhalgren 1
Ferris 1 Freemont 2 Possum 1 Ubik 2
Nancy 2 Freemont 2 Lemur 2 Housekeeping 1
Nancy 2 Drew 1 Penguin 1 Ubik 2
然后,您可以将其用作CTE(或子查询因子,我认为在oracle术语中)并仅从每列中提取最高频率值:
with tmp_tab as (
select /*+ MATERIALIZE */
firstname, count(*) over (partition by firstname) as c_fn,
lastname, count(*) over (partition by lastname) as c_ln,
favoriteanimal, count(*) over (partition by favoriteanimal) as c_fa,
favoritebook, count(*) over (partition by favoritebook) as c_fb
from my_table)
select (select firstname from (
select firstname,
row_number() over (partition by null order by c_fn desc) as r_fn
from tmp_tab
) where r_fn = 1) as firstname,
(select lastname from (
select lastname,
row_number() over (partition by null order by c_ln desc) as r_ln
from tmp_tab
) where r_ln = 1) as lastname,
(select favoriteanimal from (
select favoriteanimal,
row_number() over (partition by null order by c_fa desc) as r_fa
from tmp_tab
) where r_fa = 1) as favoriteanimal,
(select favoritebook from (
select favoritebook,
row_number() over (partition by null order by c_fb desc) as r_fb
from tmp_tab
) where r_fb = 1) as favoritebook
from dual;
FIRSTN LASTNAME FAVORIT FAVORITEBOOK
------ -------- ------- ------------
Nancy Freemont Lemur Ubik
你在CTE上为每一列做了一次传递,但这仍然只能击中真实表一次(感谢materialize
提示)。并且你可能想要添加order by
条款来调整如果有关系会做什么。
这在概念上类似于Thilo,ysth和其他人的建议,除非你让Oracle跟踪所有计数。
编辑:嗯,解释计划显示它正在进行四次全表扫描;可能需要多考虑一下......
编辑2:将(未记录的)MATERIALIZE
提示添加到CTE似乎解决了这个问题;它正在创建一个临时临时表来保存结果,并且只进行一次全表扫描。然而,解释计划成本更高 - 至少在这个时间样本数据集上。对任何不利的评论感兴趣。
答案 1 :(得分:2)
到目前为止,我使用纯Oracle SQL实现的最佳效果类似于@AlexPoole所做的。我使用count(A)而不是count(*)将空值推到底部。
with
NUM_ROWS_RETURNED as (
select 4 as NUM from dual
),
SAMPLE_DATA as (
select /*+ materialize */
A,B,C,D,E
from (
select 1 as A, 1 as B, 4 as C, 1 as D, 4 as E from dual
union all select 1 , -2 , 3 , 2 , 3 from dual
union all select 1 , -2 , 2 , 2 , 3 from dual
union all select null , 1 , 1 , 3 , 2 from dual
union all select null , 2 , 4 , null , 2 from dual
union all select null , 1 , 3 , null , 2 from dual
union all select null , 1 , 2 , null , 1 from dual
union all select null , 1 , 4 , null , 1 from dual
union all select null , 1 , 3 , 3 , 1 from dual
union all select null , 1 , 4 , 3 , 1 from dual
)
),
RANKS as (
select /*+ materialize */
rownum as RANKED
from
SAMPLE_DATA
where
rownum <= (select min(NUM) from NUM_ROWS_RETURNED)
)
select
r.RANKED,
max(case when A_RANK = r.RANKED then A else null end) as A,
max(case when B_RANK = r.RANKED then B else null end) as B,
max(case when C_RANK = r.RANKED then C else null end) as C,
max(case when D_RANK = r.RANKED then D else null end) as D,
max(case when E_RANK = r.RANKED then E else null end) as E
from (
select
A, dense_rank() over (order by A_COUNTS desc) as A_RANK,
B, dense_rank() over (order by B_COUNTS desc) as B_RANK,
C, dense_rank() over (order by C_COUNTS desc) as C_RANK,
D, dense_rank() over (order by D_COUNTS desc) as D_RANK,
E, dense_rank() over (order by E_COUNTS desc) as E_RANK
from (
select
A, count(A) over (partition by A) as A_COUNTS,
B, count(B) over (partition by B) as B_COUNTS,
C, count(C) over (partition by C) as C_COUNTS,
D, count(D) over (partition by D) as D_COUNTS,
E, count(E) over (partition by E) as E_COUNTS
from
SAMPLE_DATA
)
)
cross join
RANKS r
group by
r.RANKED
order by
r.RANKED
/
给出:
RANKED| A| B| C| D| E
------|----|----|----|----|----
1| 1| 1| 4| 3| 1
2|null| -2| 3| 2| 2
3|null| 2| 2| 1| 3
4|null|null| 1|null| 4
有计划:
--------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes | Cost (%CPU)| Time |
--------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 93 | 57 (20)| 00:00:01 |
| 1 | TEMP TABLE TRANSFORMATION | | | | | |
| 2 | LOAD AS SELECT | | | | | |
| 3 | VIEW | | 10 | 150 | 20 (0)| 00:00:01 |
| 4 | UNION-ALL | | | | | |
| 5 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 6 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 7 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 8 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 9 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 10 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 11 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 12 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 13 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 14 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 15 | LOAD AS SELECT | | | | | |
|* 16 | COUNT STOPKEY | | | | | |
| 17 | VIEW | | 10 | | 2 (0)| 00:00:01 |
| 18 | TABLE ACCESS FULL | SYS_TEMP_0FD9| 10 | 150 | 2 (0)| 00:00:01 |
| 19 | SORT AGGREGATE | | 1 | | | |
| 20 | FAST DUAL | | 1 | | 2 (0)| 00:00:01 |
| 21 | SORT GROUP BY | | 1 | 93 | 33 (34)| 00:00:01 |
| 22 | MERGE JOIN CARTESIAN | | 100 | 9300 | 32 (32)| 00:00:01 |
| 23 | VIEW | | 10 | 800 | 12 (84)| 00:00:01 |
| 24 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 |
| 25 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 |
| 26 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 |
| 27 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 |
| 28 | WINDOW SORT | | 10 | 800 | 12 (84)| 00:00:01 |
| 29 | VIEW | | 10 | 800 | 7 (72)| 00:00:01 |
| 30 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 |
| 31 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 |
| 32 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 |
| 33 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 |
| 34 | WINDOW SORT | | 10 | 150 | 7 (72)| 00:00:01 |
| 35 | VIEW | | 10 | 150 | 2 (0)| 00:00:01 |
| 36 | TABLE ACCESS FULL| SYS_TEMP_0FD9| 10 | 150 | 2 (0)| 00:00:01 |
| 37 | BUFFER SORT | | 10 | 130 | 33 (34)| 00:00:01 |
| 38 | VIEW | | 10 | 130 | 2 (0)| 00:00:01 |
| 39 | TABLE ACCESS FULL | SYS_TEMP_0FD9| 10 | 130 | 2 (0)| 00:00:01 |
--------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
16 - filter( (SELECT MIN(4) FROM "SYS"."DUAL" "DUAL")>=ROWNUM)
但是对于其中一个真实的表,它看起来像(对于稍微修改过的查询):
----------------------------------------------------------------------------------------------------------------------------------------------------------
| Id | Operation | Name | Rows | Bytes |TempSpc| Cost (%CPU)| Time | Pstart| Pstop | TQ |IN-OUT| PQ Distrib |
----------------------------------------------------------------------------------------------------------------------------------------------------------
| 0 | SELECT STATEMENT | | 1 | 422 | | 6026M (1)|999:59:59 | | | | | |
| 1 | TEMP TABLE TRANSFORMATION | | | | | | | | | | | |
| 2 | LOAD AS SELECT | | | | | | | | | | | |
|* 3 | COUNT STOPKEY | | | | | | | | | | | |
| 4 | PX COORDINATOR | | | | | | | | | | | |
| 5 | PX SEND QC (RANDOM) | :TQ10000 | 10 | | | 2 (0)| 00:00:01 | | | Q1,00 | P->S | QC (RAND) |
|* 6 | COUNT STOPKEY | | | | | | | | | Q1,00 | PCWC | |
| 7 | PX BLOCK ITERATOR | | 10 | | | 2 (0)| 00:00:01 | 1 | 115 | Q1,00 | PCWC | |
| 8 | INDEX FAST FULL SCAN | IDX | 10 | | | 2 (0)| 00:00:01 | 1 | 115 | Q1,00 | PCWP | |
| 9 | SORT GROUP BY | | 1 | 422 | | 6026M (1)|999:59:59 | | | | | |
| 10 | MERGE JOIN CARTESIAN | | 22G| 8997G| | 6024M (1)|999:59:59 | | | | | |
| 11 | VIEW | | 2289M| 872G| | 1443M (1)|999:59:59 | | | | | |
| 12 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | |
| 13 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | |
| 14 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | |
| 15 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | |
| 16 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | |
| 17 | WINDOW SORT | | 2289M| 872G| 970G| 1443M (1)|999:59:59 | | | | | |
| 18 | VIEW | | 2289M| 872G| | 248M (1)|829:16:06 | | | | | |
| 19 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | |
| 20 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | |
| 21 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | |
| 22 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | |
| 23 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | |
| 24 | WINDOW SORT | | 2289M| 162G| 198G| 248M (1)|829:16:06 | | | | | |
| 25 | PARTITION RANGE ALL| | 2289M| 162G| | 3587K (4)| 11:57:36 | 1 | 115 | | | |
| 26 | TABLE ACCESS FULL | LARGE_TABLE | 2289M| 162G| | 3587K (4)| 11:57:36 | 1 | 115 | | | |
| 27 | BUFFER SORT | | 10 | 130 | | 6026M (1)|999:59:59 | | | | | |
| 28 | VIEW | | 10 | 130 | | 2 (0)| 00:00:01 | | | | | |
| 29 | TABLE ACCESS FULL | SYS_TEMP_0FD9| 10 | 130 | | 2 (0)| 00:00:01 | | | | | |
----------------------------------------------------------------------------------------------------------------------------------------------------------
Predicate Information (identified by operation id):
---------------------------------------------------
3 - filter(ROWNUM<=10)
6 - filter(ROWNUM<=10)
我可以使用from LARGE_TABLE sample (0.01)
加快速度,但却有可能导致图像失真。对于一个有20亿行的表,这会在53分钟后返回一个答案。
答案 2 :(得分:1)
你不能。
这里没有诀窍,只是原始工作。
简单地说,您必须遍历表中的每一行,并计算您感兴趣的每个列的出现次数,然后对这些结果进行排序以找到具有最高值的那些。
对于单个栏目,很容易:
SELECT col, count(*) FROM table GROUP BY col ORDER BY count(*) DESC
并获取第一行。
N列等于N个表扫描。
如果您编写逻辑并通过表一次,那么您将计算每列的每个值的每个实例。
如果你有300亿行,有300亿个值,你可以将它们全部存储起来,它们的数量都是1.你可以为你关心的每一列做到这一点。
如果这些信息对您很重要,那么您可以在数据进入时更好地跟踪它并逐步跟踪。但这是一个不同的问题。
答案 3 :(得分:1)
假设每列中没有太多不同的值,您需要执行以下操作:
对于单个列,SQL会这样做:
select value from (
select value, count(*) from the_table
group by value
order by count(*) desc
) where rownum < 2
但是,如果您只是将其中的几个组合成一个大SQL,我认为它会扫描表几次(每列一次),这是您不想要的。你能得到这个的执行计划吗?
因此,您可能必须编写一个程序来执行此操作,无论是在服务器上(PL / SQL还是Java,如果可用),或者作为客户端程序。
答案 4 :(得分:0)
循环记录,保持每个感兴趣列的每个值的内存计数次数。
每隔一段时间(每个X记录,或者当您累积了大量达到固定内存限制的数据时),循环显示内存计数并增加某些磁盘存储中的相应计数并清除内存中的内存资讯
详细信息取决于您使用的编程语言。
答案 5 :(得分:0)
下面,我提出了一个天真的方法。我认为,对于数十万以上的数据集来说,这是完全不可行的。也许大师可以用它作为更合适答案的基础。
查询结果当前的当前状态如何? 您可以选择以下查询的“分组依据”部分的结果到某种缓存中,也许每晚都可以。
然后你可以做最后的选择。
另一种可能性是在有问题的表上创建一个触发器,它将使用每次插入/更新/删除来更新“计数器”表。
计数器表如下所示:
field_value count
Nancy 2
Bill 1
Ferris 1
你必须为你想要计算的每个字段都有一个计数器表。
简而言之,我认为你需要考虑如何观察这些数据间接。我不认为实际计算需要花费很长时间才能解决这个问题。但是如果你有办法逐步追踪变化的东西,那么你只需要做一次繁重的工作。然后你的缓存+什么新的应该给你你需要的东西。
select top 1
firstname, COUNT(*) as freq
from
(
select
'Ferris' as firstname, 'Freemont' as lastname,
'Possum' as favoriteanimal, 'Ubik' as favoritebook
union all
select 'Nancy','Freemont','Lemur','Housekeeping'
union all
select 'Nancy','Drew','Penguin','Ubik'
union all
select 'Bill','Ribbits','Lemur','Dhalgren'
) sample_data
group by
firstname
order by
COUNT(*) desc