OLAP功能处理 - 为什么在N / M分区上运行M次比N次记录1次更快

时间:2017-07-20 14:23:55

标签: sql teradata sql-optimization

我有一个(非常大的)这样的表

CREATE SET TABLE LOAN 
  ( LoanNumber VARCHAR(100),
    LoanBalance DECIMAL(18,4),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber)
PARTITION BY RANGE_N
  ( ROW_INS_TS BETWEEN 
        TIMESTAMP '2017-01-01 00:00:00+00:00' 
    AND TIMESTAMP '2017-12-31 23:59:59+00:00' 
    EACH INTERVAL '1' DAY 
  );

此表通常由快照汇总,例如4月份月末快照将

-- Pretend there is just 2017 data there
CREATE SET TABLE LOAN_APRIL AS 
  ( SELECT * 
      FROM LOAN
     WHERE RecTimeStamp <= DATE '2017-04-30'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1
  )
PRIMARY INDEX (LoanNumber);

通常需要很长时间才能运行。我昨天正在试验,发现我通过像这样分开它来获得非常好的执行时间

CREATE SET TABLE LOAN_APRIL_TMP
  ( LoanNumber VARCHAR(100),
    LoanBalance DECIMAL(18,4),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber);

CREATE SET TABLE LOAN_APRIL
  ( LoanNumber VARCHAR(100),
    LoanBalance DECIMAL(18,4),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber);

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-01-01' AND DATE '2017-01-31'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-02-01' AND DATE '2017-02-28'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-03-01' AND DATE '2017-03-31'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL_TMP
    SELECT * 
      FROM LOAN
     WHERE RecTimeStamp BETWEEN DATE '2017-04-01' AND DATE '2017-04-30'
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

INSERT INTO LOAN_APRIL
    SELECT * 
      FROM LOAN_APRIL_TMP
   QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1;

我只是依次运行它,所以它们没有并行执行。今天,我将进行实验,了解如何平行加载每个细分。

另外,在更大的一点上,我很难找到足够的技术文档来解决这些类型的问题。这有很好的资源吗?我知道有许多适当的问题,但必须有一些东西至少在很高的层次上描述这些功能的实现。

1 个答案:

答案 0 :(得分:2)

可能有多种原因。您应该检查DBQL以查看实际的资源使用差异。

  • 第一选择中的数据分散在更多分区中,而不是那些较小的选择。

  • 解释可能会显示假脱机不会被内存用于大选择,但不适用于单独的选择。

  • order by中的VarChars扩展为定义大小的字符,如果LoanNumber实际上是VarChar(100)(我怀疑它是),它也会增加假脱机(但是这是针对该表的其他查询的常见问题。)

OLAP功能的一个缺点是,它们需要两个线轴,即线轴尺寸加倍。如果此表有很多列/大行,那么仅针对表的PK运行ROW_NUMBER可能会更有效,然后像这样加入:

CREATE SET TABLE LOAN_APRIL_TMP
  ( LoanNumber VARCHAR(100),
    RecTimeStamp TIMESTAMP(0)
  )
PRIMARY INDEX (LoanNumber) -- same PPI as source table to facilitate fast join back
PARTITION BY RANGE_N
  ( ROW_INS_TS BETWEEN 
        TIMESTAMP '2017-01-01 00:00:00+00:00' 
    AND TIMESTAMP '2017-12-31 23:59:59+00:00' 
    EACH INTERVAL '1' DAY 
  );

INSERT INTO LOAN_APRIL_TMP
SELECT LoanNumber, RecTimeStamp -- no other columns
FROM LOAN
WHERE RecTimeStamp <= DATE '2017-04-30'
QUALIFY row_number() OVER
             ( PARTITION BY LoanNumber 
                   ORDER BY RecTimeStamp DESC
             ) = 1
;

INSERT INTO LOAN_APRIL
SELECT l.* -- now get all columns
FROM LOAN AS l
JOIN LOAN_APRIL_TMP AS AS tmp
  ON l.LoanNumber = tmp.LoanNumber
 AND l.RecTimeStamp = tmp.RecTimeStamp