Question

我正在尝试运行以下查询。大小可能是数据库的限制，但类似大小的表正在工作。

我知道有一种方法可以使用HASHAMP，HASHBUCKET，HASHROW函数对查询进行分区，但我不知道如何执行此操作。

查询很简单，我只是检查main_acct_product_id变量是否在b表上。

有关查询中表格的一些信息：

sel count(*) from graph_total_3
678.336.354

top 5 of graph_total_3
id_phone    destino WEIGHT  DIR access_method_id    access_destino  operador    producto    operador_destino
2615071884  2615628271  0,42800 0,417000    T2615071884 T2615628271 A   aa  II
1150421872  1159393065  343,200 0,424000    T1150421872 T1159393065 B   bb  LI
2914076292  2914735291  0,16500 1,003,000   T2914076292 T2914735291 C   ar  OJ
2914735291  2914076292  0,16500 -0,003000   T2914735291 T2914076292 A   tm  JA
2804535124  2804454795  0,39600 1,000,000   T2804535124 T2804454795 B   ma  UE

primary key(id_phone, destino);

sel count(*) from producto
26.473.287

top 5 of producto
    Access_Method_Id    Main_Acct_Product_Id
    T2974002818         PR_PPAL_AHORRO  
    T3875943432         PR_PPAL_ACTIVA  
    T2616294339         PR_PPAL_ACTIVA  
    T3516468805         PR_PPAL_ACTIVA  
    T2616818855         PR_PPAL_ACTIVA  

primary key(Access_Method_Id);

SHOW TABLE

show table producto

CREATE MULTISET VOLATILE TABLE MARBEL.producto ,NO FALLBACK ,
     CHECKSUM = DEFAULT,
     LOG
     (
      Access_Method_Id VARCHAR(50) CHARACTER SET LATIN NOT CASESPECIFIC,
      Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC)
PRIMARY INDEX ( Access_Method_Id )
ON COMMIT PRESERVE ROWS;

show table graph_total_3

CREATE MULTISET VOLATILE TABLE MARBEL.graph_total_3 ,NO FALLBACK ,
     CHECKSUM = DEFAULT,
     LOG
     (
      id_phone VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC,
      destino VARCHAR(21) CHARACTER SET LATIN NOT CASESPECIFIC,
      WEIGHT DECIMAL(10,5),
      DIR DECIMAL(7,6),
      access_method_id VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC,
      access_destino VARCHAR(22) CHARACTER SET LATIN NOT CASESPECIFIC,
      operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC,
      producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC,
      operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC)
PRIMARY INDEX ( id_phone ,destino )
ON COMMIT PRESERVE ROWS;

QUERY

create multiset volatile table graph_total_final as
(
select  a.* ,  coalesce(b.main_acct_product_id,'NO MOV') as producto_destino
from graph_total_3 a
left join producto b on a.access_destino=b.access_method_id
)
with data primary index (id_phone, destino)
on commit preserve rows;

解释

     This query is optimized using type 1 profile bootstrap, profileid -/. 
      1) First, we create the table header. 
      2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an
         all-rows scan with no residual conditions into Spool 2 (all_amps),
         which is redistributed by the hash code of (
         MARBEL.a.access_destino) to all AMPs.  Then we do a SORT to order
         Spool 2 by row hash.  The result spool file will not be cached in
         memory.  The size of Spool 2 is estimated with high confidence to
         be 678,343,248 rows (55,624,146,336 bytes).  The estimated time
         for this step is 2 minutes and 41 seconds. 
      3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
         RowHash match scan, which is joined to MARBEL.b by way of a
         RowHash match scan.  Spool 2 and MARBEL.b are left outer joined
         using a merge join, with condition(s) used for non-matching on
         left table ("NOT (access_destino IS NULL)"), with a join condition
         of ("access_destino = MARBEL.b.Access_Method_Id").  The result
         goes into Spool 1 (all_amps), which is redistributed by the hash
         code of (MARBEL.a.id_phone, MARBEL.a.destino) to all AMPs.  Then
         we do a SORT to order Spool 1 by row hash.  The result spool file
         will not be cached in memory.  The size of Spool 1 is estimated
         with index join confidence to be 25,085,452,093 rows (
         2,232,605,236,277 bytes).  The estimated time for this step is 1
         hour and 45 minutes. 
      4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1
         (Last Use). 
      5) Finally, we send out an END TRANSACTION step to all AMPs involved
         in processing the request.
      -> No rows are returned to the user as the result of statement 1.

EXPLAIN 2

跑完后：

DIAGNOSTIC HELPSTATS ON FOR SESSION;
EXPLAIN
create multiset volatile table graph_total_final as
(
select  a.* ,  coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino
from graph_total_3 a
left join producto b on a.access_destino=b.access_method_id
)
with data primary index (id_phone, destino, access_destino)
on commit preserve rows;

  EXPLAIN
create multiset volatile table graph_total_final as
(
select  a.* ,  coalesce(b.main_acct_product_id,'NO MOVISTAR') as producto_destino
from graph_total_3 a
left join producto b on a.access_destino=b.access_method_id
)
with data primary index (id_phone, destino, access_destino)
on commit preserve rows;

 This query is optimized using type 1 profile bootstrap, profileid -/. 
  1) First, we create the table header. 
  2) Next, we do an all-AMPs RETRIEVE step from MARBEL.a by way of an
     all-rows scan with no residual conditions into Spool 2 (all_amps),
     which is redistributed by the hash code of (
     MARBEL.a.access_destino) to all AMPs.  Then we do a SORT to order
     Spool 2 by row hash.  The result spool file will not be cached in
     memory.  The size of Spool 2 is estimated with high confidence to
     be 678,343,248 rows (55,624,146,336 bytes).  The estimated time
     for this step is 2 minutes and 41 seconds. 
  3) We do an all-AMPs JOIN step from Spool 2 (Last Use) by way of a
     RowHash match scan, which is joined to MARBEL.b by way of a
     RowHash match scan.  Spool 2 and MARBEL.b are left outer joined
     using a merge join, with condition(s) used for non-matching on
     left table ("NOT (access_destino IS NULL)"), with a join condition
     of ("access_destino = MARBEL.b.Access_Method_Id").  The result
     goes into Spool 1 (all_amps), which is redistributed by the hash
     code of (MARBEL.a.id_phone, MARBEL.a.destino,
     MARBEL.a.access_destino) to all AMPs.  Then we do a SORT to order
     Spool 1 by row hash.  The result spool file will not be cached in
     memory.  The size of Spool 1 is estimated with index join
     confidence to be 25,085,452,093 rows (2,232,605,236,277 bytes). 
     The estimated time for this step is 1 hour and 45 minutes. 
  4) We do an all-AMPs MERGE into MARBEL.graph_total_final from Spool 1
     (Last Use). 
  5) Finally, we send out an END TRANSACTION step to all AMPs involved
     in processing the request.
  -> No rows are returned to the user as the result of statement 1. 
     BEGIN RECOMMENDED STATS ->
  6) "COLLECT STATISTICS MARBEL.producto COLUMN ACCESS_METHOD_ID". 
     (HighConf)
  7) "COLLECT STATISTICS MARBEL.graph_total_3 COLUMN ACCESS_DESTINO". 
     (HighConf)
     <- END RECOMMENDED STATS

Answer 1

这些表是易失性表，这意味着您在当前会话中创建了它们，并且您可以控制它们的定义。

当您将 MARBEL.graph_total_3 的主要索引更改为 access_destino 时，您将获得直接的AMP本地加入，而无需任何准备（并且您不需要收集统计数据，因为这不会改变计划，只是估计的数字更接近现实。）

由于新的PI表格可能会有偏差，但是当您查看Exolain时，您会看到假脱机将在 access_destino 上设置PI。

如果 MARBEL.producto.Access_Method_Id 实际上是唯一的，那么您也应该将PI定义为唯一。这也将改善估计值。

Answer 2

直接击球时，有两件事让我感到奇怪。

我建议避免使用select a.*,...，除非您确实需要从A表中提取所有列。这将减少需要保留在假脱机中的数据量。

第二件看起来很可疑的是＃3 The size of Spool 1 is estimated with index join confidence to be 25,085,452,093 rows中的这句话你确定B表在access_method_id列中是唯一的 - 如果不是你可能会无意中创建一个笛卡尔积。（250亿行！ - 真的！）。

另外，请告诉我们您的A＆amp; A的人口统计信息。 B表（即主索引，表是否已分区）。

更新（查看其他信息后） 我能想到的唯一另一件事（特别是如果您的Teradata环境不是特别强大，有大量磁盘空间）是为了确保您的数据尽可能地被压缩。这将节省空间（即使数据存储在假脱机空间中）并减少所需的假脱机空间量。

以下是B表中压缩的候选者。

Main_Acct_Product_Id CHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC COMPRESS ('PR_PPAL_AHORRO', 'PR_PPAL_ACTIVA', <continue with list for about the 200 most frequently occuring main ac product ids>).

通过这样做，可以在不增加cpu时间的情况下，将每个16字节的字符串压缩到几位。

类似地对A表中的以下列执行相同的操作。

      operador VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('A','B', 'C', <other more frequently occuring operdor ids>),
      producto VARCHAR(16) CHARACTER SET LATIN NOT CASESPECIFIC compress('aa','bb', 'ar', <other more frequently occuring producto ids>),
      operador_destino VARCHAR(8) CHARACTER SET UNICODE NOT CASESPECIFIC compress('II','LI', 'OJ', <other more frequently occuring operador_destion ids>)

考虑存储id_phone＆amp; destino作为int或bigint（如果int不够大）。 Bigint占用8bytes，因为存储在varchar中你需要花费10-12bytes。当你有100百万行时，每个保存的字节都有帮助。您也可以压缩WEIGHT DIR列 - 例如：如果0.0000是最常出现的权重/目录，那么您可以指定压缩（0.0000）和增益空间。必须在创建表时指定所有compress语句。

访问method_id和access_destino似乎只是id_phone，带有'T'前缀，看看你是否可以剥离第一个字母并将它们存储为整数。所有这些都应该可以节省大量空间，并且希望能够减少这种情况。执行查询所需的假脱机空间。

最后，我不知道通过hashamp / bucket / row对查询进行分区（我是分区表而不是查询） - Teradata应该并行执行所有查询。

在Teradata中遇到假脱机错误

2 个答案: