Oracle SQL:使用多维数据集

时间:2018-04-18 14:00:26

标签: sql group-by oracle12c

我的目标:

我希望通过使用多维数据集来明确计算每种可能的组合。

我使用的查询:

select         col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                count(distinct col_1)    count_distinct
from            tmp_test_data
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

示例数据(完整表格有100K行):

col3 col4 col5 col6 col7 col8 col9 count_distinct   
2    3    1    1    1    1    1    12
2    3    1    1    1    1         12
2    3    1    1    1    2    1    1
2    3    1    1    1    2    2    8
2    3    1    1    1    2         9
2    3    1    1    1         1    13
2    3    1    1    1         2    8
2    3    1    1    1              21
...

我面临的问题: 使用count(distinct col_1)会影响查询的性能(~10分钟),而count(col1)则相当快(~10秒)。 在检查解释计划时,似乎非重复计数强制64'按汇总分组'

解释计划:

count(col1)

Plan hash value: 3126999781
| Id  | Operation            | Name          | Rows  | Bytes | Cost (%CPU)| Time     |

|   0 | SELECT STATEMENT     |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   1 |  SORT GROUP BY       |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   2 |   GENERATE CUBE      |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   3 |    SORT GROUP BY     |               |   288 |  8640 |  1316   (3)| 00:00:01 |
|   4 |     TABLE ACCESS FULL| TMP_TEST_DATA |   668K|    19M|  1296   (1)| 00:00:01 |

计数(不同的col_1

Plan hash value: 1939696204

---------------------------------------------------------------------------------------------------------
| Id  | Operation                  | Name                       | Rows  | Bytes | Cost (%CPU)| Time     |
---------------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT           |                            |   288 | 29952 | 50234   (4)| 00:00:02 |
|   1 |  TEMP TABLE TRANSFORMATION |                            |       |       |            |          |
|   2 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C98_2ACFFE0 |       |       |            |          |
|   3 |    TABLE ACCESS FULL       | TMP_TEST_DATA              |   668K|    19M|  1296   (1)| 00:00:01 |
|   4 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
|   5 |    SORT GROUP BY ROLLUP    |                            |   288 |  8640 |   765   (4)| 00:00:01 |
|   6 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
|   7 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
|   8 |    SORT GROUP BY ROLLUP    |                            |   204 |  6120 |   765   (4)| 00:00:01 |
|   9 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
...
| 190 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
| 191 |    SORT GROUP BY ROLLUP    |                            |     3 |    90 |   765   (4)| 00:00:01 |
| 192 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
| 193 |   LOAD AS SELECT           | SYS_TEMP_0FD9E9C9A_2ACFFE0 |       |       |            |          |
| 194 |    SORT GROUP BY ROLLUP    |                            |     2 |    60 |   765   (4)| 00:00:01 |
| 195 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C98_2ACFFE0 |   668K|    19M|   745   (1)| 00:00:01 |
| 196 |   SORT ORDER BY            |                            |   288 | 29952 |     3  (34)| 00:00:01 |
| 197 |    VIEW                    |                            |   288 | 29952 |     2   (0)| 00:00:01 |
| 198 |     TABLE ACCESS FULL      | SYS_TEMP_0FD9E9C9A_2ACFFE0 |   288 |  8640 |     2   (0)| 00:00:01 |
---------------------------------------------------------------------------------------------------------

有没有办法改善这个?

1 个答案:

答案 0 :(得分:2)

简答

不,如果您确实需要精确的count_distinct结果,我看不到改进方法。

如果您可以使用近似值,那么使用函数APPROX_COUNT_DISTINCT可能是一种选择。

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                approx_count_distinct(col_1)    approx_count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

更长的答案

我创建了这个测试表

CREATE TABLE t AS
SELECT round(abs(dbms_random.normal)*10,0) AS col_1,
       round(dbms_random.VALUE(2,3),0) AS col_3,
       round(dbms_random.VALUE(3,4),0) AS col_4,
       round(dbms_random.VALUE(1,2),0) AS col_5,
       round(dbms_random.VALUE(1,2),0) AS col_6,
       round(dbms_random.VALUE(1,2),0) AS col_7,
       round(dbms_random.VALUE(1,2),0) AS col_8,
       round(dbms_random.VALUE(1,2),0) AS col_9
  FROM xmltable('1 to 20000');

将statistics_level设置为ALL以收集详细的执行计划统计信息

ALTER SESSION SET statistics_level = 'ALL';

在表t上执行原始查询,而不是tmp_test_data

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                count(distinct col_1)    count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

产生了这个结果

     COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 COUNT_DISTINCT
---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------
         2          3          1          1          1          1          1             27
         2          3          1          1          1          1          2             24
         2          3          1          1          1          1                        31
...
                                                                           2             40
                                                                                         41

2.187 rows selected. 

和这个执行计划。

---------------------------------------------------------------------------------------------------
| Id  | Operation                                |Starts | E-Rows | A-Rows |   A-Time   | Buffers |
---------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |     1 |        |   2187 |00:00:01.85 |      87 |
|   1 |  TEMP TABLE TRANSFORMATION               |     1 |        |   2187 |00:00:01.85 |      87 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|     1 |        |      0 |00:00:00.07 |      86 |
|   3 |    HASH GROUP BY                         |     1 |    464 |   3224 |00:00:00.02 |      85 |
|   4 |     TABLE ACCESS FULL                    |     1 |  20000 |  20000 |00:00:00.01 |      85 |
|   5 |   SORT ORDER BY                          |     1 |     16 |   2187 |00:00:01.77 |       0 |
|   6 |    VIEW                                  |     1 |    408 |   2187 |00:00:01.75 |       0 |
|   7 |     VIEW                                 |     1 |    408 |   2187 |00:00:01.73 |       0 |
|   8 |      UNION-ALL                           |     1 |        |   2187 |00:00:01.72 |       0 |
|   9 |       SORT GROUP BY ROLLUP               |     1 |     16 |    192 |00:00:00.03 |       0 |
...
| 133 |       SORT GROUP BY ROLLUP               |     1 |      3 |      6 |00:00:00.03 |       0 |
| 134 |        TABLE ACCESS FULL                 |     1 |    464 |   3224 |00:00:00.01 |       0 |
| 135 |       SORT GROUP BY ROLLUP               |     1 |      2 |      3 |00:00:00.02 |       0 |
| 136 |        TABLE ACCESS FULL                 |     1 |    464 |   3224 |00:00:00.01 |       0 |
---------------------------------------------------------------------------------------------------

有趣的是列A-Rows(实际行数)A-Time(实际花费的时间)和Buffers(逻辑读取次数)。我们看到87个逻辑I / O的查询耗时1.85秒。所有64 SORT GROUP BY ROLLUP花了1.75秒,每次操作大约0.03秒。 Oracle需要逐个组合地评估col_1的不同值的数量。 COUNT(col_1)中没有捷径。这就是它成本高昂的原因。

但是,我们可以轻松提出另一种查询

WITH
   combi AS (
      SELECT col_3, 
             col_4, 
             col_5, 
             col_6, 
             col_7, 
             col_8, 
             col_9
        FROM t
       GROUP BY CUBE (
                   col_3, 
                   col_4, 
                   col_5, 
                   col_6, 
                   col_7, 
                   col_8, 
                   col_9
                )
   ),
   fullset AS (
      SELECT t.col_1,
             combi.col_3, 
             combi.col_4, 
             combi.col_5, 
             combi.col_6, 
             combi.col_7, 
             combi.col_8, 
             combi.col_9
        FROM combi
        JOIN t
          ON     (t.col_3 = combi.col_3 or combi.col_3 is null)
             AND (t.col_4 = combi.col_4 or combi.col_4 is null)
             AND (t.col_5 = combi.col_5 or combi.col_5 is null)
             AND (t.col_6 = combi.col_6 or combi.col_6 is null)
             AND (t.col_7 = combi.col_7 or combi.col_7 is null)
             AND (t.col_8 = combi.col_8 or combi.col_8 is null)
             AND (t.col_9 = combi.col_9 or combi.col_9 is null)
   )
SELECT col_3, 
       col_4, 
       col_5, 
       col_6, 
       col_7, 
       col_8, 
       col_9,
       COUNT(DISTINCT col_1) as count_distinct_col_1
  FROM fullset
 GROUP BY col_3, 
          col_4, 
          col_5, 
          col_6, 
          col_7, 
          col_8, 
          col_9
 ORDER BY col_3, 
          col_4, 
          col_5, 
          col_6, 
          col_7, 
          col_8, 
          col_9;

产生相同的结果

     COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 COUNT_DISTINCT_COL_1
---------- ---------- ---------- ---------- ---------- ---------- ---------- --------------------
         2          3          1          1          1          1          1                   27
         2          3          1          1          1          1          2                   24
         2          3          1          1          1          1                              31
...
                                                                           2                   40
                                                                                               41

2.187 rows selected.

执行计划中的行数较少。

-------------------------------------------------------------------------------------------------
| Id  | Operation                 | Name      | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
-------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT          |           |      1 |        |   2187 |00:00:41.58 |     185K|
|   1 |  SORT GROUP BY            |           |      1 |     16 |   2187 |00:00:41.58 |     185K|
|   2 |   VIEW                    | VM_NWVW_1 |      1 |    464 |  67812 |00:00:41.54 |     185K|
|   3 |    HASH GROUP BY          |           |      1 |    464 |  67812 |00:00:41.54 |     185K|
|   4 |     NESTED LOOPS          |           |      1 |   2500 |   2560K|00:00:31.77 |     185K|
|   5 |      VIEW                 |           |      1 |     16 |   2187 |00:00:00.37 |      85 |
|   6 |       SORT GROUP BY       |           |      1 |     16 |   2187 |00:00:00.36 |      85 |
|   7 |        GENERATE CUBE      |           |      1 |     16 |  16384 |00:00:00.27 |      85 |
|   8 |         SORT GROUP BY     |           |      1 |     16 |    128 |00:00:00.20 |      85 |
|   9 |          TABLE ACCESS FULL| T         |      1 |  20000 |  20000 |00:00:00.10 |      85 |
|* 10 |      TABLE ACCESS FULL    | T         |   2187 |    156 |   2560K|00:00:13.09 |     185K|
-------------------------------------------------------------------------------------------------

让我们看一下操作5.我们在0.37秒内生成所有2187个组合,并需要85个逻辑I / O来读取整个表t。然后我们再次访问这些2187组合中的每一个的完整表t(参见操作4和10)。完整的join需要31.77秒。剩余的group by操作需要9.77秒,最后的sort只需0.04。秒。

此备用查询看起来很简单,但由于命名查询combifullset的连接所需的额外I / O操作速度慢得多。

原始视图在I / O和运行时方面更好。虽然执行计划看起来很广泛,但效率很高。最后,DISTINCT中的COUNT(DISTINCT col_1)正在推动复杂性。它只是一个词,但是一个完全不同的算法。因此,如果准确的结果很重要,我不会看到如何改进原始查询。但是,如果近似值足够好,那么使用函数APPROX_COUNT_DISTINCT可能是一种选择。

select          col_3,
                col_4, 
                col_5, 
                col_6,
                col_7, 
                col_8,
                col_9,
                approx_count_distinct(col_1)    approx_count_distinct
from            t
group           by cube (
                         col_3,
                         col_4,
                         col_5,
                         col_6,
                         col_7,
                         col_8,
                         col_9
                        )
order by        col_3,
                col_4,
                col_5,
                col_6,
                col_7,
                col_8,
                col_9
                ;

结果相似

     COL_3      COL_4      COL_5      COL_6      COL_7      COL_8      COL_9 APPROX_COUNT_DISTINCT
---------- ---------- ---------- ---------- ---------- ---------- ---------- ---------------------
         2          3          1          1          1          1          1                    27
         2          3          1          1          1          1          2                    24
         2          3          1          1          1          1                               31
...
                                                                           2                    40
                                                                                                41

2.187 rows selected. 

但执行计划更加复杂。

----------------------------------------------------------------------------------------------------
| Id  | Operation                                | Starts | E-Rows | A-Rows |   A-Time   | Buffers |
----------------------------------------------------------------------------------------------------
|   0 | SELECT STATEMENT                         |      1 |        |   2187 |00:00:09.88 |      87 |
|   1 |  TEMP TABLE TRANSFORMATION               |      1 |        |   2187 |00:00:09.88 |      87 |
|   2 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.33 |      86 |
|   3 |    TABLE ACCESS FULL                     |      1 |  20000 |  20000 |00:00:00.08 |      85 |
|   4 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.16 |       0 |
|   5 |    SORT GROUP BY ROLLUP APPROX           |      1 |     16 |    192 |00:00:00.16 |       0 |
|   6 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
...
| 190 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.14 |       0 |
| 191 |    SORT GROUP BY ROLLUP APPROX           |      1 |      3 |      6 |00:00:00.14 |       0 |
| 192 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
| 193 |   LOAD AS SELECT (CURSOR DURATION MEMORY)|      1 |        |      0 |00:00:00.14 |       0 |
| 194 |    SORT GROUP BY ROLLUP APPROX           |      1 |      2 |      3 |00:00:00.14 |       0 |
| 195 |     TABLE ACCESS FULL                    |      1 |  20000 |  20000 |00:00:00.07 |       0 |
| 196 |   SORT ORDER BY                          |      1 |     16 |   2187 |00:00:00.01 |       0 |
| 197 |    VIEW                                  |      1 |     16 |   2187 |00:00:00.01 |       0 |
| 198 |     TABLE ACCESS FULL                    |      1 |     16 |   2187 |00:00:00.01 |       0 |
----------------------------------------------------------------------------------------------------

并且查询比原始查询慢。预计大型数据集的速度会更快。因此,如果不需要100%的准确度,我建议您尝试APPROX_COUNT_DISTINCT

Statistic_Level ALL

的运行时开销

要获取实际行数和执行计划中花费的实际时间,我已使用statistics_level ALL运行所有查询。这会导致显着的性能开销(预期,也见Jonathan Lewis' blog about gather_plan_staticis)。将staticstics_level设置为TYPICAL时,所有查询运行得更快。以下是以秒为单位的运行时间。在客户端打印结果的时间:

Query                  Runtime with 'ALL'  Runtime with 'TYPICAL' 
----------------       ------------------  ----------------------
Original (good)                     2.615                   0.977
Alternative (bad)                  41.773                   4.991
Approx_Count_Distinct              10.600                   1.113