如何在Apache Spark SQL中计算不同的工作

时间:2019-07-17 02:01:26

标签: apache-spark apache-spark-sql

我正在尝试计算不同日期范围内不同数量的实体。

我需要了解spark如何执行此操作

val distinct_daily_cust_12month = sqlContext.sql(s"select distinct day_id,txn_type,customer_id from ${db_name}.fact_customer where day_id>='${start_last_12month}' and day_id<='${start_date}' and txn_type not in (6,99)")

val category_mapping = sqlContext.sql(s"select * from datalake.category_mapping");

val daily_cust_12month_ds =distinct_daily_cust_12month.join(broadcast(category_mapping),distinct_daily_cust_12month("txn_type")===category_mapping("id")).select("category","sub_category","customer_id","day_id")

daily_cust_12month_ds.createOrReplaceTempView("daily_cust_12month_ds")

val total_cust_metrics = sqlContext.sql(s"""select 'total' as category,
count(distinct(case when day_id='${start_date}' then customer_id end)) as yest,
count(distinct(case when day_id>='${start_week}' and day_id<='${end_week}' then customer_id end)) as week,
count(distinct(case when day_id>='${start_month}' and day_id<='${start_date}' then customer_id end)) as mtd,
count(distinct(case when day_id>='${start_last_month}' and day_id<='${end_last_month}' then customer_id end)) as ltd,
count(distinct(case when day_id>='${start_last_6month}' and day_id<='${start_date}' then customer_id end)) as lsm,
count(distinct(case when day_id>='${start_last_12month}' and day_id<='${start_date}' then customer_id end)) as ltm
from daily_cust_12month_ds
""")

没有错误,但这会花费很多时间。我想知道在Spark中是否有更好的方法

1 个答案:

答案 0 :(得分:2)

通过对数据进行哈希分区,然后按分区对不同元素进行计数,最后对计数求和,来计算不同的作品。通常,由于完全改组,这是一项繁重的操作,而对于Spark或很可能是任何完全分布式的系统而言,这没有什么灵丹妙药,具有 distinct 的操作本质上很难在分布式系统中解决。

在某些情况下,有更快的方法可以实现:

  • 如果可以接受近似值,则approx_count_distinct通常会更快,因为它基于HyperLogLog,并且要改组的数据量比使用确切实现要少得多。
  • 如果您可以以已经对数据源进行分区的方式设计管道,以使分区之间不存在任何重复项,那么就不需要对数据帧进行哈希分区的缓慢步骤。

P.S。要了解如何计算不同的工作,您可以随时使用explain

df.select(countDistinct("foo")).explain()

示例输出:

== Physical Plan ==
*(3) HashAggregate(keys=[], functions=[count(distinct foo#3)])
+- Exchange SinglePartition
   +- *(2) HashAggregate(keys=[], functions=[partial_count(distinct foo#3)])
      +- *(2) HashAggregate(keys=[foo#3], functions=[])
         +- Exchange hashpartitioning(foo#3, 200)
            +- *(1) HashAggregate(keys=[foo#3], functions=[])
               +- LocalTableScan [foo#3]