我试图在两个不同的语法中在pyspark中的sql下面翻译但是两个代码都给出了不同的输出,这也与sql输出不匹配。我不知道这些代码的实际差异在哪里。
select count(*) from (
select afpo.charg as Batch_Number,
mara1.matkl as Material_Group,
mara1.zzmanu_stg as Mfg_Stage_Code,
mkpf.budat as WCB_261_Posting_Date,
mch1.hsdat as Manufacturing_Date
from
opssup_dev_wrk_sap.src_sap_afpo afpo
inner join opssup_dev_wrk_sap.src_sap_mara mara1 on afpo.matnr=mara1.matnr
inner join opssup_dev_wrk_sap.src_sap_mseg mseg on afpo.aufnr=mseg.aufnr
inner join opssup_dev_wrk_sap.src_sap_mkpf mkpf on mseg.mblnr=mkpf.mblnr
inner join opssup_dev_wrk_sap.src_sap_mara mara on mseg.matnr=mara.matnr
inner join opssup_dev_wrk_sap.src_sap_mch1 mch1 on afpo.charg=mch1.charg
where mara.zzmanu_stg='WCB'
and mseg.bwart='261')
---它返回2505行 上面的sql查询的执行计划:
*(15) Project [charg#72 AS Batch_Number#327407, matkl#126 AS Material_Group#327408, zzmanu_stg#275 AS Mfg_Stage_Code#327409, budat#511 AS WCB_261_Posting_Date#327410, hsdat#571 AS Manufacturing_Date#327411]
+- *(15) SortMergeJoin [charg#72], [charg#543], Inner
:- *(12) Sort [charg#72 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(charg#72, 200)
: +- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
: +- *(11) BroadcastHashJoin [matnr#321], [matnr#327416], Inner, BuildRight, false
: :- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, matnr#321, budat#511]
: : +- *(11) SortMergeJoin [mblnr#313], [mblnr#505], Inner
: : :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(mblnr#313, 200)
: : : +- *(6) Project [charg#72, matkl#126, zzmanu_stg#275, mblnr#313, matnr#321]
: : : +- *(6) ...
我已经在pyspark中转换了这个sql,如下所示:
afpo_df = sqlContext.table(sap_source_schema + ".src_sap_afpo").alias('afpo_df')
mara1_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara1_df')
mseg_df = sqlContext.table(sap_source_schema + ".src_sap_mseg").alias('mseg_df')
mkpf_df = sqlContext.table(sap_source_schema + ".src_sap_mkpf").alias('mkpf_df')
mara_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara_df')
mch1_df = sqlContext.table(sap_source_schema + ".src_sap_mch1").alias('mch1_df')
temp12_df = afpo_df \
.join(mara1_df,(afpo_df.matnr==mara1_df.matnr)) \
.join(mseg_df,(afpo_df.aufnr==mseg_df.aufnr)) \
.join(mkpf_df,(mseg_df.mblnr==mkpf_df.mblnr)) \
.join(mara_df,(mseg_df.matnr==mara_df.matnr)) \
.join(mch1_df,(afpo_df.charg==mch1_df.charg)) \
.filter("mseg_df.bwart=='261' AND mara_df.zzmanu_stg=='WCB'") \
.select(afpo_df.charg.alias('Batch_Number'),mara1_df.matkl.alias('Material_Group'),mara1_df.zzmanu_stg.alias('Mfg_Stage_Code'), \
mkpf_df.budat.alias('WCB_261_Posting_Date'),mch1_df.hsdat.alias('Manufacturing_Date'))
target_df = temp12_df
print(target_df.count())
返回大约13L行
上述代码的相应查询计划:
> == Physical Plan ==
*(15) Project [charg#72 AS Batch_Number#322732, matkl#126 AS Material_Group#322733, zzmanu_stg#275 AS Mfg_Stage_Code#322734, budat#511 AS WCB_261_Posting_Date#322735, hsdat#571 AS Manufacturing_Date#322736]
+- *(15) BroadcastNestedLoopJoin BuildRight, Inner
:- *(15) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511, hsdat#571]
: +- *(15) SortMergeJoin [charg#72], [charg#543], Inner
: :- *(11) Sort [charg#72 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(charg#72, 200)
: : +- *(10) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
: : +- *(10) SortMergeJoin [mblnr#313], [mblnr#505], Inner
: : :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(mblnr#313, 200)
: : : +- *(6) Project [charg#72, matkl#126, zzmanu_stg#275, mblnr#313]
: : : +- *(6) SortMergeJoin [aufnr#14, matnr#116], [aufnr#368, matnr#321], Inner
: : : :- *(3) Sort [aufnr#14 ASC NULLS FIRST, matnr#116 ASC NULLS FIRST], false, 0
: : : : +- Exchange hashpartitioning(aufnr#14, matnr#116, 200)
: : : : +- *(2) Project [aufnr#14, charg#72, matnr#116, matkl#126, zzmanu_stg#275]
: : : : +- *(2) BroadcastHashJoin [matnr#33], [matnr#116], Inner, BuildRight, false
: : : : :- *(2) Project [aufnr#14, matnr#33, charg#72]
: : : : : +- *(2) Filter ((isnotnull(matnr#33) && isnotnull(aufnr#14)) && isnotnull(charg#72))
: : : : : +- *(2) FileScan parquet opssup_dev_wrk_sap.src_sap_afpo[aufnr#14,matnr#33,charg#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_afpo], PartitionFilters: [], PushedFilters: [IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(charg)], ReadSchema: struct<aufnr:string,matnr:string,charg:string>
: : : : +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
: : : : +- *(1) Project [matnr#116, matkl#126, zzmanu_stg#275]
: : : : +- *(1) Filter isnotnull(matnr#116)
: : : : +- *(1) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#116,matkl#126,zzmanu_stg#275] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(matnr)], ReadSchema: struct<matnr:string,matkl:string,zzmanu_stg:string>
: : : +- *(5) Sort [aufnr#368 ASC NULLS FIRST, matnr#321 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(aufnr#368, matnr#321, 200)
: : : +- *(4) Project [mblnr#313, matnr#321, aufnr#368]
: : : +- *(4) Filter ((((isnotnull(bwart#319) && (bwart#319 = 261)) && isnotnull(matnr#321)) && isnotnull(aufnr#368)) && isnotnull(mblnr#313))
: : : +- *(4) FileScan parquet opssup_dev_wrk_sap.src_sap_mseg[mblnr#313,bwart#319,matnr#321,aufnr#368] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mseg], PartitionFilters: [], PushedFilters: [IsNotNull(bwart), EqualTo(bwart,261), IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,bwart:string,matnr:string,aufnr:string>
: : +- *(9) Sort [mblnr#505 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(mblnr#505, 200)
: : +- *(8) Project [mblnr#505, budat#511]
: : +- *(8) Filter isnotnull(mblnr#505)
: : +- *(8) FileScan parquet opssup_dev_wrk_sap.src_sap_mkpf[mblnr#505,budat#511] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mkpf], PartitionFilters: [], PushedFilters: [IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,budat:string>
: +- *(13) Sort [charg#543 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(charg#543, 200)
: +- *(12) Project [charg#543, hsdat#571]
: +- *(12) Filter isnotnull(charg#543)
: +- *(12) FileScan parquet opssup_dev_wrk_sap.src_sap_mch1[charg#543,hsdat#571] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mch1], PartitionFilters: [], PushedFilters: [IsNotNull(charg)], ReadSchema: struct<charg:string,hsdat:string>
+- BroadcastExchange IdentityBroadcastMode
+- *(14) Project
+- *(14) Filter (isnotnull(zzmanu_stg#318210) && (zzmanu_stg#318210 = WCB))
+- *(14) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[zzmanu_stg#318210] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(zzmanu_stg), EqualTo(zzmanu_stg,WCB)], ReadSchema: struct<zzmanu_stg:string>
我再次试过
afpo_df = sqlContext.table(sap_source_schema + ".src_sap_afpo").alias('afpo_df')
mara1_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara1_df')
mseg_df = sqlContext.table(sap_source_schema + ".src_sap_mseg").alias('mseg_df')
mkpf_df = sqlContext.table(sap_source_schema + ".src_sap_mkpf").alias('mkpf_df')
mara_df = sqlContext.table(sap_source_schema + ".src_sap_mara").alias('mara_df')
mch1_df = sqlContext.table(sap_source_schema + ".src_sap_mch1").alias('mch1_df')
temp12_df = afpo_df \
.join(mara1_df,"matnr") \
.join(mseg_df,"aufnr") \
.join(mkpf_df,"mblnr") \
.join(mara_df,"matnr") \
.join(mch1_df,"charg") \
.filter("mseg_df.bwart=='261' AND mara_df.zzmanu_stg=='WCB'") \
.select(afpo_df.charg.alias('Batch_Number'),mara1_df.matkl.alias('Material_Group'),mara1_df.zzmanu_stg.alias('Mfg_Stage_Code'), \
mkpf_df.budat.alias('WCB_261_Posting_Date'),mch1_df.hsdat.alias('Manufacturing_Date'))
target_df = temp12_df
print(target_df.count())
它返回1804行
上述代码的执行计划::
== Physical Plan ==
*(15) Project [charg#72 AS Batch_Number#301751, matkl#126 AS Material_Group#301752, zzmanu_stg#275 AS Mfg_Stage_Code#301753, budat#511 AS WCB_261_Posting_Date#301754, hsdat#571 AS Manufacturing_Date#301755]
+- *(15) SortMergeJoin [charg#72], [charg#543], Inner
:- *(12) Sort [charg#72 ASC NULLS FIRST], false, 0
: +- Exchange hashpartitioning(charg#72, 200)
: +- *(11) Project [charg#72, matkl#126, zzmanu_stg#275, budat#511]
: +- *(11) BroadcastHashJoin [matnr#33], [matnr#300069], Inner, BuildRight, false
: :- *(11) Project [matnr#33, charg#72, matkl#126, zzmanu_stg#275, budat#511]
: : +- *(11) SortMergeJoin [mblnr#313], [mblnr#505], Inner
: : :- *(7) Sort [mblnr#313 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(mblnr#313, 200)
: : : +- *(6) Project [matnr#33, charg#72, matkl#126, zzmanu_stg#275, mblnr#313]
: : : +- *(6) SortMergeJoin [aufnr#14], [aufnr#368], Inner
: : : :- *(3) Sort [aufnr#14 ASC NULLS FIRST], false, 0
: : : : +- Exchange hashpartitioning(aufnr#14, 200)
: : : : +- *(2) Project [matnr#33, aufnr#14, charg#72, matkl#126, zzmanu_stg#275]
: : : : +- *(2) BroadcastHashJoin [matnr#33], [matnr#116], Inner, BuildRight, false
: : : : :- *(2) Project [aufnr#14, matnr#33, charg#72]
: : : : : +- *(2) Filter ((isnotnull(matnr#33) && isnotnull(aufnr#14)) && isnotnull(charg#72))
: : : : : +- *(2) FileScan parquet opssup_dev_wrk_sap.src_sap_afpo[aufnr#14,matnr#33,charg#72] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_afpo], PartitionFilters: [], PushedFilters: [IsNotNull(matnr), IsNotNull(aufnr), IsNotNull(charg)], ReadSchema: struct<aufnr:string,matnr:string,charg:string>
: : : : +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
: : : : +- *(1) Project [matnr#116, matkl#126, zzmanu_stg#275]
: : : : +- *(1) Filter isnotnull(matnr#116)
: : : : +- *(1) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#116,matkl#126,zzmanu_stg#275] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(matnr)], ReadSchema: struct<matnr:string,matkl:string,zzmanu_stg:string>
: : : +- *(5) Sort [aufnr#368 ASC NULLS FIRST], false, 0
: : : +- Exchange hashpartitioning(aufnr#368, 200)
: : : +- *(4) Project [mblnr#313, aufnr#368]
: : : +- *(4) Filter (((isnotnull(bwart#319) && (bwart#319 = 261)) && isnotnull(aufnr#368)) && isnotnull(mblnr#313))
: : : +- *(4) FileScan parquet opssup_dev_wrk_sap.src_sap_mseg[mblnr#313,bwart#319,aufnr#368] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mseg], PartitionFilters: [], PushedFilters: [IsNotNull(bwart), EqualTo(bwart,261), IsNotNull(aufnr), IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,bwart:string,aufnr:string>
: : +- *(9) Sort [mblnr#505 ASC NULLS FIRST], false, 0
: : +- Exchange hashpartitioning(mblnr#505, 200)
: : +- *(8) Project [mblnr#505, budat#511]
: : +- *(8) Filter isnotnull(mblnr#505)
: : +- *(8) FileScan parquet opssup_dev_wrk_sap.src_sap_mkpf[mblnr#505,budat#511] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mkpf], PartitionFilters: [], PushedFilters: [IsNotNull(mblnr)], ReadSchema: struct<mblnr:string,budat:string>
: +- BroadcastExchange HashedRelationBroadcastMode(ArrayBuffer(input[0, string, true]))
: +- *(10) Project [matnr#300069]
: +- *(10) Filter ((isnotnull(zzmanu_stg#300228) && (zzmanu_stg#300228 = WCB)) && isnotnull(matnr#300069))
: +- *(10) FileScan parquet opssup_dev_wrk_sap.src_sap_mara[matnr#300069,zzmanu_stg#300228] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mara], PartitionFilters: [], PushedFilters: [IsNotNull(zzmanu_stg), EqualTo(zzmanu_stg,WCB), IsNotNull(matnr)], ReadSchema: struct<matnr:string,zzmanu_stg:string>
+- *(14) Sort [charg#543 ASC NULLS FIRST], false, 0
+- Exchange hashpartitioning(charg#543, 200)
+- *(13) Project [charg#543, hsdat#571]
+- *(13) Filter isnotnull(charg#543)
+- *(13) FileScan parquet opssup_dev_wrk_sap.src_sap_mch1[charg#543,hsdat#571] Batched: true, Format: Parquet, Location: InMemoryFileIndex[s3://amgen-edl-ois-opssup-shr-bkt/dev/west2/wrk/sap/src_sap_mch1], PartitionFilters: [], PushedFilters: [IsNotNull(charg)], ReadSchema: struct<charg:string,hsdat:string>
为什么会发生这种情况,以及在pyspark中转换上述sql查询的最佳方法是什么。