将SQL结果集分区为块

时间:2015-02-11 01:31:52

标签: sql

在准备批处理过程中,我需要对记录组进行分区,以便运行作业的并行流。这些记录来自一个可能有数百万行的表。我的目标是将这些记录(通过主键)均匀地分解为(大约)偶数块,然后可以并行处理。我想动态选择块大小。值得注意的是,主键序列中可能存在间隙。

换句话说,给定此表,使用表示块数的谓词和提供块的第一个和最后一个序列的结果集:

  seq    name   |
-------|--------|
1      | john   |
2      | joe    |
3      | joe    |
4      | joe    |
5      | joe    |
567    | kent   |
568    | katie  |
20000  | sue    |
200016 | jill   |
200027 | bill   |

我会得到以下结果,其中(块数) - > (first-seq,last-seq):

(2) -> (1,5),(567,20027)
(5) -> (1,2),(3,4),(5,567),(568,20000),(200016,200027)

或者,作为结果集,这样的东西(当要求5个块时):

 first_seq   last_seq 
-----------|----------|
  1        | 2        |
  3        | 4        |
  5        | 567      |
  568      | 200000   |
  20016    | 200027   |

我假设某种窗口功能在这里是有序的,但我不知道如何解决这个问题。任何人都可以帮我查询吗?

2 个答案:

答案 0 :(得分:2)

NTILE函数可能适用于Oracle(我不确定DB2):

SELECT seq, ntile( 2 ) over (order by seq) chunk_num
  FROM my_table

(其中2是块数)

或者在您描述的布局中获得结果:

SELECT chunk_num, MIN(seq), MAX(seq) FROM (
  SELECT seq, ntile( 2 ) over (order by seq) chunk_num
    FROM my_tab
  )
  GROUP BY chunk_num

如果块的数量不能均匀地划分行数,则会将多余的行放入编号较低的块中。

答案 1 :(得分:1)

认为这应该适用于大多数数据库系统。

1)已将chunk放入字段列表中以使其更详细;同样适用于order by

2)将序列拆分为10块与... (10 / (num_rows + ...

select MIN(seq) as first_seq, MAX(seq) as last_seq, chunk from
        /*- Basic grouping formula pseudo: #row_chunk_number = round-up( ( #total_num_chunks / #total_num_rows ) x #current_row_num )
          - The +0.0 is to convert field values to floats
          - floor() + 1 means the same as rounding up ... and im not sure if ceil() exists on all DB systems.
        */
        (select seq, floor(((10 / (num_rows + 0.0)) + 0.0) * (row_num + 0.0)) + 1 as chunk from
        (select 
            seq,
            /*`row_num` is the row number in the sequence range - achieved by iteratively counting all sequences smaller than current (assuming seq is unique and numeric).*/
            (select COUNT(*) from table1 as b where b.seq < a.seq) as row_num,
            /*`num_rows` is the number of rows in the sequence range - added to inner query to prevent cluttering the actual math calc in the outer query (same performance).*/
            (select COUNT(*) from table1 ) as num_rows
        /*dat1 is a derived table of seq (id), num_rows (total number rows) and row_num (row number)*/
        from table1 as a) as dat) as dat1 
group by chunk
order by chunk