我正在尝试运行以下查询。大小可能是数据库的限制,但类似大小的表正在工作。
我知道有一种方法可以使用HASHAMP,HASHBUCKET,HASHROW函数对查询进行分区,但我不知道如何执行此操作。
查询很简单,我只是检查main_acct_product_id变量是否在b表上。
有关查询中表格的一些信息:
sel count(*) from graph_total_3
678.336.354
top 5 of graph_total_3
id_phone destino WEIGHT DIR access_method_id access_destino operador producto operador_destino
2615071884 2615628271 0,42800 0,417000 T2615071884 T2615628271 A aa II
1150421872 1159393065 343,200 0,424000 T1150421872 T1159393065 B bb LI
2914076292 2914735291 0,16500 1,003,000 T2914076292 T2914735291 C ar OJ
2914735291 2914076292 0,16500 -0,003000 T2914735291 T2914076292 A tm JA
2804535124 2804454795 0,39600 1,000,000 T2804535124 T2804454795 B ma UE
primary key(id_phone, destino);
sel count(*) from producto
26.473.287
top 5 of producto
Access_Method_Id Main_Acct_Product_Id
T2974002818 PR_PPAL_AHORRO
T3875943432 PR_PPAL_ACTIVA
T2616294339 PR_PPAL_ACTIVA
T3516468805 PR_PPAL_ACTIVA
T2616818855 PR_PPAL_ACTIVA
primary key(Access_Method_Id);
SHOW TABLE
show table producto
CREATE MULTISET VOLATILE TABLE MARBEL.producto ,NO FALLBACK ,
CHECKSUM = DEFAULT,
LOG
(
Access_Method_Id VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC,
Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( Access_Method_Id )
ON COMMIT PRESERVE ROWS;
show table graph_total_3
CREATE MULTISET VOLATILE TABLE MARBEL.graph_total_3 ,NO FALLBACK ,
CHECKSUM = DEFAULT,
LOG
(
id_phone VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC,
destino VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC,
WEIGHT DECIMAL(10,5),
DIR DECIMAL(7,6),
access_method_id VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC,
access_destino VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC,
operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC,
producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC,
operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC)
PRIMARY INDEX ( id_phone ,destino )
ON COMMIT PRESERVE ROWS;
QUERY
create multiset volatile table graph_total_final as
(
select a.* , coalesce(b.main_acct_product_id,'NO MOV') as producto_destino
from graph_total_3 a
left join producto b on a.access_destino=b.access_method_id
)
with data primary index (id_phone, destino)
on commit preserve rows;
解释
This query is optimized using type 1 profile bootstrap, profileid -/.
1) First, we create the table header.
2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an
all-rows scan with no residual conditions into Spool 2 (all_amps),
which is redistributed by the hash code of (
MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order
Spool 2 by row hash. The result spool file will not be cached in
memory. The size of Spool 2 is estimated with high confidence to
be 678,343,248 rows (55,624,146,336 bytes). The estimated time
for this step is 2 minutes and 41 seconds.
3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
RowHash match scan, which is joined to MARBEL.b by way of a
RowHash match scan. Spool 2 and MARBEL.b are left outer joined
using a merge join, with condition(s) used for non-matching on
left table ("NOT (access_destino IS NULL)"), with a join condition
of ("access_destino = MARBEL.b.Access_Method_Id"). The result
goes into Spool 1 (all_amps), which is redistributed by the hash
code of (MARBEL.a.id_phone, MARBEL.a.destino) to all AMPs. Then
we do a SORT to order Spool 1 by row hash. The result spool file
will not be cached in memory. The size of Spool 1 is estimated
with index join confidence to be 25,085,452,093 rows (
2,232,605,236,277 bytes). The estimated time for this step is 1
hour and 45 minutes.
4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1
(Last Use).
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
EXPLAIN 2
跑完后:
DIAGNOSTIC HELPSTATS ON FOR SESSION;
EXPLAIN
create multiset volatile table graph_total_final as
(
select a.* , coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino
from graph_total_3 a
left join producto b on a.access_destino=b.access_method_id
)
with data primary index (id_phone, destino, access_destino)
on commit preserve rows;
EXPLAIN
create multiset volatile table graph_total_final as
(
select a.* , coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino
from graph_total_3 a
left join producto b on a.access_destino=b.access_method_id
)
with data primary index (id_phone, destino, access_destino)
on commit preserve rows;
This query is optimized using type 1 profile bootstrap, profileid -/.
1) First, we create the table header.
2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an
all-rows scan with no residual conditions into Spool 2 (all_amps),
which is redistributed by the hash code of (
MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order
Spool 2 by row hash. The result spool file will not be cached in
memory. The size of Spool 2 is estimated with high confidence to
be 678,343,248 rows (55,624,146,336 bytes). The estimated time
for this step is 2 minutes and 41 seconds.
3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
RowHash match scan, which is joined to MARBEL.b by way of a
RowHash match scan. Spool 2 and MARBEL.b are left outer joined
using a merge join, with condition(s) used for non-matching on
left table ("NOT (access_destino IS NULL)"), with a join condition
of ("access_destino = MARBEL.b.Access_Method_Id"). The result
goes into Spool 1 (all_amps), which is redistributed by the hash
code of (MARBEL.a.id_phone, MARBEL.a.destino,
MARBEL.a.access_destino) to all AMPs. Then we do a SORT to order
Spool 1 by row hash. The result spool file will not be cached in
memory. The size of Spool 1 is estimated with index join
confidence to be 25,085,452,093 rows (2,232,605,236,277 bytes).
The estimated time for this step is 1 hour and 45 minutes.
4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1
(Last Use).
5) Finally, we send out an END TRANSACTION step to all AMPs involved
in processing the request.
-> No rows are returned to the user as the result of statement 1.
BEGIN RECOMMENDED STATS ->
6) "COLLECT STATISTICS MARBEL.producto COLUMN ACCESS_METHOD_ID".
(HighConf)
7) "COLLECT STATISTICS MARBEL.graph_total_3 COLUMN ACCESS_DESTINO".
(HighConf)
<- END RECOMMENDED STATS
答案 0 :(得分:3)
这些表是易失性表,这意味着您在当前会话中创建了它们,并且您可以控制它们的定义。
当您将 MARBEL.graph_total_3 的主要索引更改为 access_destino 时,您将获得直接的AMP本地加入,而无需任何准备(并且您不需要收集统计数据,因为这不会改变计划,只是估计的数字更接近现实。)
由于新的PI表格可能会有偏差,但是当您查看Exolain时,您会看到假脱机将在 access_destino 上设置PI。
如果 MARBEL.producto.Access_Method_Id 实际上是唯一的,那么您也应该将PI定义为唯一。这也将改善估计值。
答案 1 :(得分:2)
直接击球时,有两件事让我感到奇怪。
我建议避免使用select a.*,...
,除非您确实需要从A表中提取所有列。这将减少需要保留在假脱机中的数据量。
第二件看起来很可疑的是#3 The size of Spool 1 is estimated with index join confidence to be 25,085,452,093 rows
中的这句话你确定B表在access_method_id
列中是唯一的 - 如果不是你可能会无意中创建一个笛卡尔积。 (250亿行! - 真的!)。
另外,请告诉我们您的A&amp; A的人口统计信息。 B表(即主索引,表是否已分区)。
更新(查看其他信息后) 我能想到的唯一另一件事(特别是如果您的Teradata环境不是特别强大,有大量磁盘空间)是为了确保您的数据尽可能地被压缩。这将节省空间(即使数据存储在假脱机空间中)并减少所需的假脱机空间量。
以下是B表中压缩的候选者。
Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('PR_PPAL_AHORRO', 'PR_PPAL_ACTIVA', <continue with list for about the 200 most frequently occuring main ac product ids>).
通过这样做,可以在不增加cpu时间的情况下,将每个16字节的字符串压缩到几位。
类似地对A表中的以下列执行相同的操作。
operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('A','B', 'C', <other more frequently occuring operdor ids>),
producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC compress('aa','bb', 'ar', <other more frequently occuring producto ids>),
operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('II','LI', 'OJ', <other more frequently occuring operador_destion ids>)
考虑存储id_phone&amp; destino作为int或bigint(如果int不够大)。 Bigint占用8bytes,因为存储在varchar中你需要花费10-12bytes。当你有100百万行时,每个保存的字节都有帮助。您也可以压缩WEIGHT DIR列 - 例如:如果0.0000
是最常出现的权重/目录,那么您可以指定压缩(0.0000)和增益空间。必须在创建表时指定所有compress
语句。
访问method_id和access_destino似乎只是id_phone,带有'T'前缀,看看你是否可以剥离第一个字母并将它们存储为整数。所有这些都应该可以节省大量空间,并且希望能够减少这种情况。执行查询所需的假脱机空间。
最后,我不知道通过hashamp / bucket / row对查询进行分区(我是分区表而不是查询) - Teradata应该并行执行所有查询。