在Spark Streaming中加入大数据

时间:2019-02-14 02:11:46

标签: sql apache-spark join apache-spark-sql

我们有一个拥有700万条记录的大客户表,我们正在尝试处理来自kafka流的某些交易数据(每批50万条消息)。

在处理期间,我们需要将交易数据与客户数据结合在一起。目前,这大约需要10秒,而要求是将其降低到5秒。由于客户表太大,因此我们不能使用广播联接。我们还可以进行其他优化吗?

== Parsed Logical Plan ==
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Join Inner, Some((custId#110 = rowkey#0))
   :- Subquery custProfile
   :  +- Project [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4]
   :     +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
   :        +- Subquery jz_view_sub_cust_profile
   :           +- Project [rowkey#0,thrd_party_ads_opto_flag#4,no_mkt_opto_flag#5]
   :              +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
   +- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166

== Analyzed Logical Plan ==
count: bigint
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Join Inner, Some((custId#110 = rowkey#0))
   :- Subquery custProfile
   :  +- Project [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4]
   :     +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
   :        +- Subquery jz_view_sub_cust_profile
   :           +- Project [rowkey#0,thrd_party_ads_opto_flag#4,no_mkt_opto_flag#5]
   :              +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
   +- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166

== Optimized Logical Plan ==
Aggregate [(count(1),mode=Complete,isDistinct=false) AS count#119L]
+- Project
   +- Join Inner, Some((custId#110 = rowkey#0))
      :- Project [rowkey#0]
      :  +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
      :     +- MetastoreRelation db_localhost, ext_sub_cust_profile, None
      +- Project [custId#110]
         +- LogicalRDD [key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118], MapPartitionsRDD[190] at rddToDataFrameHolder at custStream.scala:166

== Physical Plan ==
TungstenAggregate(key=[], functions=[(count(1),mode=Final,isDistinct=false)], output=[count#119L])
+- TungstenExchange SinglePartition, None
   +- TungstenAggregate(key=[], functions=[(count(1),mode=Partial,isDistinct=false)], output=[count#122L])
      +- Project
         +- SortMergeJoin [rowkey#0], [custId#110]
            :- Sort [rowkey#0 ASC], false, 0
            :  +- TungstenExchange hashpartitioning(rowkey#0,200), None
            :     +- Project [rowkey#0]
            :        +- Filter ((no_mkt_opto_flag#5 = N) && (thrd_party_ads_opto_flag#4 = N))
            :           +- HiveTableScan [rowkey#0,no_mkt_opto_flag#5,thrd_party_ads_opto_flag#4], MetastoreRelation db_localhost, ext_sub_cust_profile, None
            +- Sort [custId#110 ASC], false, 0
               +- TungstenExchange hashpartitioning(custId#110,200), None
                  +- Project [custId#110]
                     +- Scan ExistingRDD[key#109,custId#110,mktOptOutFlag#117,thirdPartyOptOutFlag#118]

1 个答案:

答案 0 :(得分:1)

  1. 假设客户数据在微型批次之间是恒定的,请使用哈希分区程序将客户数据划分到customerId上,并将其缓存在RDD / DF中。
  2. 由于交易数据来自Kafka,因此在发布到Kafka时,还可以使用哈希分区程序在同一个键上对这些数据进行分区 https://www.javaworld.com/article/3066873/big-data/big-data-messaging-with-kafka-part-2.html

这应该减少加入两个数据集的时间,但条件是两个数据集(事务数据和客户数据)中的分区键必须相同。