这是我的情况。
场景1: 我有一个选择了5列的数据帧(总共1 TB数据),然后与另一个数据帧(6 GB)联接,然后写入s3,此过程需要25分钟。
方案2: 与方案1相同,但在选择中又增加了5列(因此总共选择了10列),并且与方案1相同,因此此过程需要4个多小时。不知道为什么。
有什么想法吗?我知道现在的随机数据更多,但是这很糟糕?我应该在哪里寻找线索?
示例代码:
val df1 = spark.read.parquet("s3://path1").select(5 cols)// (10 cols)
val filterIds=List("1","2","3","4","5","6","7","8")
val df1_1=df1.filter(!col("tier1_id").isin(filterIds:_*))
val df2 = spark.read.parquet("s3://path2").toDF("col1","col2").filter(col("col1").isNotNull)
val df1_j_df2=df1_1.join(df2,df1_1.col("col1")===df2.col("col1"))
df1_j_df2.write.mode(SaveMode.Overwrite).parquet(config.getString("s3://path3") + "data_dt=" + processDate + "/")
从真实代码解释计划
/* select 5 columns */
== Physical Plan ==
*(5) SortMergeJoin [mdn_ssp_raw#95], [mdn#111], Inner
:- *(2) Sort [mdn_ssp_raw#95 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(mdn_ssp_raw#95, 4000)
: +- *(1) Project [mdn#13 AS mdn_ssp_raw#95, url#14, application#20, tier1_id#15, tier2_id#16, CASE WHEN (isnull(cast(hits#24 as int)) || (length(trim(cast(cast(hits#24 as int) as string), None)) = 0)) THEN -1 ELSE cast(hits#24 as int) END AS hits#73, CASE WHEN (isnull(cast(bytes_down#26 as int)) || (length(trim(cast(cast(bytes_down#26 as int) as string), None)) = 0)) THEN -1 ELSE cast(bytes_down#26 as int) END AS bytes_down#84, cast(bytes_up#25 as int) AS bytes_up#62, service_provider_id#17, timestamp#12]
: +- *(1) Filter (((tier1_id#15 INSET (1195)) && NOT tier1_id#15 IN (1002,1003,1184,1185,1188,1063,1064,1065)) && isnotnull(mdn#13))
: +- *(1) FileScan parquet [timestamp#12,mdn#13,url#14,tier1_id#15,tier2_id#16,service_provider_id#17,application#20,hits#24,bytes_up#25,bytes_down#26] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path1/processing/browsing/ssp/ssp_nation_aggregat..., PartitionFilters: [], PushedFilters: [In(tier1_id, [1195,1193,1003,1098,1055,1048,1146,1167,1040,1020,1079,1017,1136,1154,1115,1050,11..., ReadSchema: struct<timestamp:string,mdn:string,url:string,tier1_id:string,tier2_id:string,service_provider...
+- *(4) Sort [mdn#111 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(mdn#111, 4000)
+- *(3) Project [_COL_0#106 AS MACDEVID#110, _COL_1#107 AS mdn#111]
+- *(3) Filter isnotnull(_COL_1#107)
+- *(3) FileScan parquet [_COL_0#106,_COL_1#107] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path2/poc3/id_op_device_mdn_ssp], PartitionFilters: [], PushedFilters: [IsNotNull(_COL_1)], ReadSchema: struct<_COL_0:string,_COL_1:string>
/* select 10 columns*/
== Physical Plan ==
*(5) SortMergeJoin [mdn_ssp_raw#110], [mdn#131], Inner
:- *(2) Sort [mdn_ssp_raw#110 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(mdn_ssp_raw#110, 4000)
: +- *(1) Project [mdn#13 AS mdn_ssp_raw#110, url#14, application#20, tier1_id#15, tier2_id#16, CASE WHEN (isnull(cast(hits#24 as int)) || (length(trim(cast(cast(hits#24 as int) as string), None)) = 0)) THEN -1 ELSE cast(hits#24 as int) END AS hits#78, CASE WHEN (isnull(cast(bytes_down#26 as int)) || (length(trim(cast(cast(bytes_down#26 as int) as string), None)) = 0)) THEN -1 ELSE cast(bytes_down#26 as int) END AS bytes_down#94, cast(bytes_up#25 as int) AS bytes_up#62, service_provider_id#17, timestamp#12, device_id#18, os_id#19, ntc_id#32, ad_id#33, session_id#31]
: +- *(1) Filter (((tier1_id#15 INSET (1195)) && NOT tier1_id#15 IN (1002,1003,1184,1185,1188,1063,1064,1065)) && isnotnull(mdn#13))
: +- *(1) FileScan parquet [timestamp#12,mdn#13,url#14,tier1_id#15,tier2_id#16,service_provider_id#17,device_id#18,os_id#19,application#20,hits#24,bytes_up#25,bytes_down#26,session_id#31,ntc_id#32,ad_id#33] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path1/processing/browsing/ssp/ssp_nation_aggregat..., PartitionFilters: [], PushedFilters: [In(tier1_id, [1195,1193,1003,1098,1055,1048,1146,1167,1040,1020,1079,1017,1136,1154,1115,1050,11..., ReadSchema: struct<timestamp:string,mdn:string,url:string,tier1_id:string,tier2_id:string,service_provider...
+- *(4) Sort [mdn#131 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(mdn#131, 4000)
+- *(3) Project [_COL_0#126 AS MACDEVID#130, _COL_1#127 AS mdn#131]
+- *(3) Filter isnotnull(_COL_1#127)
+- *(3) FileScan parquet [_COL_0#126,_COL_1#127] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://path2/poc3/id_op_device_mdn_ssp], PartitionFilters: [], PushedFilters: [IsNotNull(_COL_1)], ReadSchema: struct<_COL_0:string,_COL_1:string>
火花提交
--conf spark.sql.parquet.fs.optimized.committer.optimization-enabled=true
--executor-memory 28g
--executor-cores 5
--driver-memory 28g
--driver-cores 5
--conf spark.executor.memoryOverhead=2800
--conf "spark.dynamicAllocation.minExecutors=30"
--conf "spark.dynamicAllocation.maxExecutors=300"
--conf "spark.shuffle.compress=true"
--conf spark.sql.shuffle.partitions=4000
--conf spark.serializer=org.apache.spark.serializer.KryoSerializer
--conf spark.sql.parquet.mergeSchema=true
--conf spark.sql.parquet.filterPushdown=true
--conf spark.sql.parquet.compression.codec=gzip
--conf spark.dynamicAllocation.enabled=true
--conf spark.sql.hive.metastorePartitionPruning=true
--conf spark.speculation=true
--conf "spark.shuffle.service.enabled=true"
--conf "spark.shuffle.spill.compress=true"
--conf spark.default.parallelism=1000
--conf spark.memory.storageFraction=0.1