Apache Spark可以选择使用bucketBy命令拆分为多个文件。例如,如果我有1亿个用户ID,我可以将表拆分为32个不同的文件,其中使用某种类型的散列算法在文件之间分配和查找数据。
Postgres能以某种方式将表拆分成固定数量的分区吗?如果它不是本机特征,它仍然可以完成,例如生成散列;把哈希变成一个数字;取模数%32作为parititon范围。
答案 0 :(得分:1)
模数的例子:
短分区设置:
db=# create table p(i int);
CREATE TABLE
db=# create table p1 ( check (mod(i,3)=0) ) inherits (p);
CREATE TABLE
db=# create table p2 ( check (mod(i,3)=1) ) inherits (p);
CREATE TABLE
db=# create table p3 ( check (mod(i,3)=2) ) inherits (p);
CREATE TABLE
db=# create rule pir3 AS ON insert to p where mod(i,3) = 2 do instead insert into p3 values (new.*);
CREATE RULE
db=# create rule pir2 AS ON insert to p where mod(i,3) = 1 do instead insert into p2 values (new.*);
CREATE RULE
db=# create rule pir1 AS ON insert to p where mod(i,3) = 0 do instead insert into p1 values (new.*);
CREATE RULE
检查:
db=# insert into p values (1),(2),(3),(4),(5);
INSERT 0 0
db=# select * from p;
i
---
3
1
4
2
5
(5 rows)
db=# select * from p1;
i
---
3
(1 row)
db=# select * from p2;
i
---
1
4
(2 rows)
db=# select * from p3;
i
---
2
5
(2 rows)
https://www.postgresql.org/docs/current/static/tutorial-inheritance.html https://www.postgresql.org/docs/current/static/ddl-partitioning.html
分区工作的和演示:
db=# explain analyze select * from p where mod(i,3) = 2;
QUERY PLAN
----------------------------------------------------------------------------------------------------
Append (cost=0.00..48.25 rows=14 width=4) (actual time=0.013..0.015 rows=2 loops=1)
-> Seq Scan on p (cost=0.00..0.00 rows=1 width=4) (actual time=0.004..0.004 rows=0 loops=1)
Filter: (mod(i, 3) = 2)
-> Seq Scan on p3 (cost=0.00..48.25 rows=13 width=4) (actual time=0.009..0.011 rows=2 loops=1)
Filter: (mod(i, 3) = 2)
Planning time: 0.203 ms
Execution time: 0.052 ms
(7 rows)