为什么公制"输出行数"在多次使用表时,Apache Spark UI中显示的值是否大于表的大小?

时间:2018-04-10 17:35:46

标签: apache-spark apache-spark-sql spark-webui

我在TPCDS基准测试的查询47中首次遇到此行为。

为澄清这是查询。

--q47.sql--                                                                                                                                                     

  with v1 as(                                                                                                                                                    
  select i_category, i_brand,                                                                                                                                    
         s_store_name, s_company_name,                                                                                                                           
         d_year, d_moy,                                                                                                                                          
         sum(ss_sales_price) sum_sales,                                                                                                                          
         avg(sum(ss_sales_price)) over                                                                                                                           
           (partition by i_category, i_brand,                                                                                                                    
                      s_store_name, s_company_name, d_year)                                                                                                      
           avg_monthly_sales,                                                                                                                                    
         rank() over                                                                                                                                             
           (partition by i_category, i_brand,                                                                                                                    
                      s_store_name, s_company_name                                                                                                               
            order by d_year, d_moy) rn                                                                                                                           
  from item, store_sales, date_dim, store                                                                                                                        
  where ss_item_sk = i_item_sk and                                                                                                                               
        ss_sold_date_sk = d_date_sk and                                                                                                                          
        ss_store_sk = s_store_sk and                                                                                                                             
        (                                                                                                                                                        
          d_year = 1999 or                                                                                                                                       
          ( d_year = 1999-1 and d_moy =12) or                                                                                                                    
          ( d_year = 1999+1 and d_moy =1)                                                                                                                        
        )                                                                                                                                                        
  group by i_category, i_brand,                                                                                                                                  
           s_store_name, s_company_name,                                                                                                                         
           d_year, d_moy),                                                                                                                                       
  v2 as(                                                                                                                                                         
  select v1.i_category, v1.i_brand, v1.s_store_name, v1.s_company_name, v1.d_year,                                                                               
                      v1.d_moy, v1.avg_monthly_sales ,v1.sum_sales, v1_lag.sum_sales psum,                                                                       
                      v1_lead.sum_sales nsum                                                                                                                     
  from v1, v1 v1_lag, v1 v1_lead                                                                                                                                 
  where v1.i_category = v1_lag.i_category and                                                                                                                    
        v1.i_category = v1_lead.i_category and                                                                                                                   
        v1.i_brand = v1_lag.i_brand and                                                                                                                          
        v1.i_brand = v1_lead.i_brand and                                                                                                                         
        v1.s_store_name = v1_lag.s_store_name and                                                                                                                
        v1.s_store_name = v1_lead.s_store_name and                                                                                                               
        v1.s_company_name = v1_lag.s_company_name and                                                                                                            
        v1.s_company_name = v1_lead.s_company_name and                                                                                                           
        v1.rn = v1_lag.rn + 1 and                                                                                                                                
        v1.rn = v1_lead.rn - 1)                                                                                                                                  
  select * from v2                                                                                                                                               
  where  d_year = 1999 and                                                                                                                                       
         avg_monthly_sales > 0 and                                                                                                                               
         case when avg_monthly_sales > 0 then abs(sum_sales - avg_monthly_sales) / avg_monthly_sales else null end > 0.1                                         
  order by sum_sales - avg_monthly_sales, 3                                                                                                                      
  limit 100

我们可以看到表v1在查询中使用了3次

...
from v1, v1 v1_lag, v1 v1_lead
...

Web UI中的图表如下

enter image description here

正如我们在左图中看到的,表number of output rows的{​​{1}}的值等于store_sales,它等于表的大小。

但是,在右侧图表中,它显示同一个表的2,879,789等于number of output rows,此值会传播到下一个计划,例如5,759,578

我们可以通过更简单的查询获得相同的结果。

Filter

此查询的图表如下

enter image description here

正如我们所看到的,// create a temp table for tests Seq(1, 2, 3).toDF("id").createOrReplaceTempView("t") // execute the query spark.sql(""" with v1 as ( select id from t group by id) select v1.id, v11.id id1, v12.id id2 from v1, v1 v11, v1 v12 where v1.id = v11.id and v1.id = v12.id + 1 """).count 比表的大小高两倍。此外,如果我们再次添加表v1 number of output rows是表的大小的三倍,依此类推。

例如,如果我们像这样更改查询

number of output rows

... select v1.id, v11.id id1, v12.id id2, v13.id id3 from v1, v1 v11, v1 v12, v1 v13 where v1.id = v11.id and v1.id = v12.id + 1 and v1.id = v13.id ... 变为9。

值得一提的是,如果我们仅使用表格number of output rows两次,则v1将等于表格大小。

所以,使用像这样的查询

number of output rows

... select v1.id, v11.id id1 from v1, v1 v11 where v1.id = v11.id ... 变为3。

在这些情况下,我希望Spark能够像需要表一样多次加载表,或者加载表一次,然后在需要时重用它,但似乎我的两个假设都是错误的。

那么,为什么输出行数高于表大小?

我已经在Spark 2.2和2.3中对此进行了测试。

0 个答案:

没有答案