在准备批处理过程中,我需要对记录组进行分区,以便运行作业的并行流。这些记录来自一个可能有数百万行的表。我的目标是将这些记录(通过主键)均匀地分解为(大约)偶数块,然后可以并行处理。我想动态选择块大小。值得注意的是,主键序列中可能存在间隙。
换句话说,给定此表,使用表示块数的谓词和提供块的第一个和最后一个序列的结果集:
seq name |
-------|--------|
1 | john |
2 | joe |
3 | joe |
4 | joe |
5 | joe |
567 | kent |
568 | katie |
20000 | sue |
200016 | jill |
200027 | bill |
我会得到以下结果,其中(块数) - > (first-seq,last-seq):
(2) -> (1,5),(567,20027)
(5) -> (1,2),(3,4),(5,567),(568,20000),(200016,200027)
或者,作为结果集,这样的东西(当要求5个块时):
first_seq last_seq -----------|----------| 1 | 2 | 3 | 4 | 5 | 567 | 568 | 200000 | 20016 | 200027 |
我假设某种窗口功能在这里是有序的,但我不知道如何解决这个问题。任何人都可以帮我查询吗?
答案 0 :(得分:2)
NTILE函数可能适用于Oracle(我不确定DB2):
SELECT seq, ntile( 2 ) over (order by seq) chunk_num
FROM my_table
(其中2是块数)
或者在您描述的布局中获得结果:
SELECT chunk_num, MIN(seq), MAX(seq) FROM (
SELECT seq, ntile( 2 ) over (order by seq) chunk_num
FROM my_tab
)
GROUP BY chunk_num
如果块的数量不能均匀地划分行数,则会将多余的行放入编号较低的块中。
答案 1 :(得分:1)
认为这应该适用于大多数数据库系统。
1)已将chunk
放入字段列表中以使其更详细;同样适用于order by
2)将序列拆分为10
块与... (10 / (num_rows +
...
select MIN(seq) as first_seq, MAX(seq) as last_seq, chunk from
/*- Basic grouping formula pseudo: #row_chunk_number = round-up( ( #total_num_chunks / #total_num_rows ) x #current_row_num )
- The +0.0 is to convert field values to floats
- floor() + 1 means the same as rounding up ... and im not sure if ceil() exists on all DB systems.
*/
(select seq, floor(((10 / (num_rows + 0.0)) + 0.0) * (row_num + 0.0)) + 1 as chunk from
(select
seq,
/*`row_num` is the row number in the sequence range - achieved by iteratively counting all sequences smaller than current (assuming seq is unique and numeric).*/
(select COUNT(*) from table1 as b where b.seq < a.seq) as row_num,
/*`num_rows` is the number of rows in the sequence range - added to inner query to prevent cluttering the actual math calc in the outer query (same performance).*/
(select COUNT(*) from table1 ) as num_rows
/*dat1 is a derived table of seq (id), num_rows (total number rows) and row_num (row number)*/
from table1 as a) as dat) as dat1
group by chunk
order by chunk