火花洗牌花费的时间太长

时间:2019-12-24 14:02:18

标签: apache-spark apache-spark-sql

这是我的情况。

场景1:  我有一个选择了5列的数据帧(总共1 TB数据),然后与另一个数据帧(6 GB)联接,然后写入s3,此过程需要25分钟。

方案2: 与方案1相同,但在选择中又增加了5列(因此总共选择了10列),并且与方案1相同,因此此过程需要4个多小时。不知道为什么。

有什么想法吗?我知道现在的随机数据更多,但是这很糟糕?我应该在哪里寻找线索?

示例代码

    val df1 = spark.read.parquet("s3://path1").select(5 cols)// (10 cols) 
    val filterIds=List("1","2","3","4","5","6","7","8")
    val  df1_1=df1.filter(!col("tier1_id").isin(filterIds:_*))
    val df2 = spark.read.parquet("s3://path2").toDF("col1","col2").filter(col("col1").isNotNull)
    val df1_j_df2=df1_1.join(df2,df1_1.col("col1")===df2.col("col1"))
    df1_j_df2.write.mode(SaveMode.Overwrite).parquet(config.getString("s3://path3")  + "data_dt=" + processDate + "/")

从真实代码解释计划

/* select 5 columns */
== Physical Plan ==
*(5) SortMergeJoin [mdn_ssp_raw#95], [mdn#111], Inner
:- *(2) Sort [mdn_ssp_raw#95 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(mdn_ssp_raw#95, 4000)
:     +- *(1) Project [mdn#13 AS mdn_ssp_raw#95, url#14, application#20, tier1_id#15, tier2_id#16, CASE WHEN (isnull(cast(hits#24 as int)) || (length(trim(cast(cast(hits#24 as int) as string), None)) = 0)) THEN -1 ELSE cast(hits#24 as int) END AS hits#73, CASE WHEN (isnull(cast(bytes_down#26 as int)) || (length(trim(cast(cast(bytes_down#26 as int) as string), None)) = 0)) THEN -1 ELSE cast(bytes_down#26 as int) END AS bytes_down#84, cast(bytes_up#25 as int) AS bytes_up#62, service_provider_id#17, timestamp#12]
:        +- *(1) Filter (((tier1_id#15 INSET (1195)) && NOT tier1_id#15 IN (1002,1003,1184,1185,1188,1063,1064,1065)) && isnotnull(mdn#13))
:           +- *(1) FileScan parquet [timestamp#12,mdn#13,url#14,tier1_id#15,tier2_id#16,service_provider_id#17,application#20,hits#24,bytes_up#25,bytes_down#26] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path1/processing/browsing/ssp/ssp_nation_aggregat..., PartitionFilters: [], PushedFilters: [In(tier1_id, [1195,1193,1003,1098,1055,1048,1146,1167,1040,1020,1079,1017,1136,1154,1115,1050,11..., ReadSchema: struct<timestamp:string,mdn:string,url:string,tier1_id:string,tier2_id:string,service_provider...
+- *(4) Sort [mdn#111 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(mdn#111, 4000)
      +- *(3) Project [_COL_0#106 AS MACDEVID#110, _COL_1#107 AS mdn#111]
         +- *(3) Filter isnotnull(_COL_1#107)
            +- *(3) FileScan parquet [_COL_0#106,_COL_1#107] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path2/poc3/id_op_device_mdn_ssp], PartitionFilters: [], PushedFilters: [IsNotNull(_COL_1)], ReadSchema: struct<_COL_0:string,_COL_1:string>


/* select 10 columns*/

== Physical Plan ==
*(5) SortMergeJoin [mdn_ssp_raw#110], [mdn#131], Inner
:- *(2) Sort [mdn_ssp_raw#110 ASC NULLS FIRST], false, 0
:  +- Exchange hashpartitioning(mdn_ssp_raw#110, 4000)
:     +- *(1) Project [mdn#13 AS mdn_ssp_raw#110, url#14, application#20, tier1_id#15, tier2_id#16, CASE WHEN (isnull(cast(hits#24 as int)) || (length(trim(cast(cast(hits#24 as int) as string), None)) = 0)) THEN -1 ELSE cast(hits#24 as int) END AS hits#78, CASE WHEN (isnull(cast(bytes_down#26 as int)) || (length(trim(cast(cast(bytes_down#26 as int) as string), None)) = 0)) THEN -1 ELSE cast(bytes_down#26 as int) END AS bytes_down#94, cast(bytes_up#25 as int) AS bytes_up#62, service_provider_id#17, timestamp#12, device_id#18, os_id#19, ntc_id#32, ad_id#33, session_id#31]
:        +- *(1) Filter (((tier1_id#15 INSET (1195)) && NOT tier1_id#15 IN (1002,1003,1184,1185,1188,1063,1064,1065)) && isnotnull(mdn#13))
:           +- *(1) FileScan parquet [timestamp#12,mdn#13,url#14,tier1_id#15,tier2_id#16,service_provider_id#17,device_id#18,os_id#19,application#20,hits#24,bytes_up#25,bytes_down#26,session_id#31,ntc_id#32,ad_id#33] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path1/processing/browsing/ssp/ssp_nation_aggregat..., PartitionFilters: [], PushedFilters: [In(tier1_id, [1195,1193,1003,1098,1055,1048,1146,1167,1040,1020,1079,1017,1136,1154,1115,1050,11..., ReadSchema: struct<timestamp:string,mdn:string,url:string,tier1_id:string,tier2_id:string,service_provider...
+- *(4) Sort [mdn#131 ASC NULLS FIRST], false, 0
   +- Exchange hashpartitioning(mdn#131, 4000)
      +- *(3) Project [_COL_0#126 AS MACDEVID#130, _COL_1#127 AS mdn#131]
         +- *(3) Filter isnotnull(_COL_1#127)
            +- *(3) FileScan parquet [_COL_0#126,_COL_1#127] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path2/poc3/id_op_device_mdn_ssp], PartitionFilters: [], PushedFilters: [IsNotNull(_COL_1)], ReadSchema: struct<_COL_0:string,_COL_1:string>

火花提交

 --conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true 
 --executor-memory 28g 
 --executor-cores 5 
 --driver-memory 28g  
 --driver-cores 5  
 --conf spark.executor.memoryOverhead=2800  
 --conf "spark.dynamicAllocation.minExecutors=30" 
 --conf "spark.dynamicAllocation.maxExecutors=300" 
 --conf "spark.shuffle.compress=true"   
 --conf spark.sql.shuffle.partitions=4000   
 --conf spark.serializer=org.apache.spark.serializer.KryoSerializer 
 --conf spark.sql.parquet.mergeSchema=true 
 --conf spark.sql.parquet.filterPushdown=true 
 --conf spark.sql.parquet.compression.codec=gzip 
 --conf spark.dynamicAllocation.enabled=true 
 --conf spark.sql.hive.metastorePartitionPruning=true 
 --conf spark.speculation=true 
 --conf "spark.shuffle.service.enabled=true" 
 --conf "spark.shuffle.spill.compress=true"  
 --conf spark.default.parallelism=1000  
 --conf spark.memory.storageFraction=0.1   

0 个答案:

没有答案