我有以下两个数据框(在PySpark 2.1.0上):
kpi = kpi.alias("kpi")
kpi.show()
+-----+------+-------+-----+-----+------+--------------------+
|label|season|segment| kind|brand|parent| kpi|
+-----+------+-------+-----+-----+------+--------------------+
|world| all| total|world| all| null|Map(PY -> 1234, B...|
+-----+------+-------+-----+-----+------+--------------------+
和
mtd = mtd.alias("mtd")
mtd.show()
+-----+------+-------+-----+-----+------+---------------+
|label|season|segment| kind|brand|parent| mtd|
+-----+------+-------+-----+-----+------+---------------+
|world| all| total|world| all| null|[123, 245, 522]|
+-----+------+-------+-----+-----+------+---------------+
我想加入他们关于列标签,季节,细分,种类,品牌和父母的代码
kpi.join(mtd, on=["label", "season", "segment", "kind", "brand", "parent"]).show()
但我收到以下例外:
Traceback (most recent call last): File "/Users/luciawi001/PycharmProjects/daily_sales/main.py", line 124, in kpi.alias("kpi").join(mtd.alias("mtd"), on=["label", "season", "segment", "kind", "brand"]).show() File "/Users/luciawi001/Development/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/dataframe.py", line 318, in show print(self._jdf.showString(n, 20)) File "/usr/local/lib/python2.7/site-packages/py4j/java_gateway.py", line 1133, in __call__ answer, self.gateway_client, self.target_id, self.name) File "/Users/luciawi001/Development/spark/spark-2.1.0-bin-hadoop2.7/python/pyspark/sql/utils.py", line 69, in deco raise AnalysisException(s.split(': ', 1)[1], stackTrace) pyspark.sql.utils.AnalysisException: u'Detected cartesian product for INNER join between logical plans Aggregate [world AS label#1951, all AS season#1956, total AS segment#1962, world AS kind#1969, all AS brand#1977, null AS parent#1986, map(PY, sum(PY_ACT_PCS_TOT#657), MP, sum(MP_PCS_TOT#662), FCST, sum(QTY_FORECAST_VAL#655), BO, cast(sum(BACKORDER_QTY#632) as decimal(38,10)), MTD, sum(QTY_EXP_RESULT#664), NET_SALES, sum(QTY_EXP_RESULT#664)) AS pieces#1934, map(PY, round(CheckOverflow((sum(PY_ACT_NET_SALES#658) / sum(PY_ACT_PCS_TOT#657)), DecimalType(38,18)), 2), MP, round(CheckOverflow((sum(MP_NET_SALES#663) / sum(MP_PCS_TOT#662)), DecimalType(38,18)), 2), FCST, round(CheckOverflow((sum(TRN_FORECAST_VAL#656) / sum(QTY_FORECAST_VAL#655)), DecimalType(38,18)), 2), BO, round(CheckOverflow((cast(sum(NS_BO1#675) as decimal(25,3)) / sum(BACKORDER_QTY#632)), DecimalType(38,20)), 2), MTD, round(CheckOverflow((sum(NET_TRN_EXP_RESULT#665) / sum(QTY_EXP_RESULT#664)), DecimalType(38,18)), 2), NET_SALES, round(sum(NET_TRN_EXP_RESULT#665), 2), FCST_NET_SALES, round(sum(TRN_FORECAST_VAL#656), 2)) AS economics#1947] +- Project [PY_ACT_PCS_TOT#657, MP_PCS_TOT#662, QTY_FORECAST_VAL#655, BACKORDER_QTY#632, QTY_EXP_RESULT#664, PY_ACT_NET_SALES#658, MP_NET_SALES#663, TRN_FORECAST_VAL#656, NS_BO1#675, NET_TRN_EXP_RESULT#665] +- Join Inner, (MAT_ID_MATERIAL#622 = MAT_ID_MATERIAL#279) :- Project [MAT_ID_MATERIAL#622, PY_ACT_PCS_TOT#657, MP_PCS_TOT#662, QTY_FORECAST_VAL#655, BACKORDER_QTY#632, QTY_EXP_RESULT#664, PY_ACT_NET_SALES#658, MP_NET_SALES#663, TRN_FORECAST_VAL#656, NS_BO1#675, NET_TRN_EXP_RESULT#665] : +- Join Inner, (DC_ID_DISTR_CHANNEL#620 = DC_ID_DISTR_CHANNEL#260) : :- Project [DC_ID_DISTR_CHANNEL#620, MAT_ID_MATERIAL#622, PY_ACT_PCS_TOT#657, MP_PCS_TOT#662, QTY_FORECAST_VAL#655, BACKORDER_QTY#632, QTY_EXP_RESULT#664, PY_ACT_NET_SALES#658, MP_NET_SALES#663, TRN_FORECAST_VAL#656, NS_BO1#675, NET_TRN_EXP_RESULT#665] : : +- Join Inner, (SO_ID_SALES_ORG#619 = SO_ID_SALES_ORG#221) : : :- Project [SO_ID_SALES_ORG#619, DC_ID_DISTR_CHANNEL#620, MAT_ID_MATERIAL#622, PY_ACT_PCS_TOT#657, MP_PCS_TOT#662, QTY_FORECAST_VAL#655, BACKORDER_QTY#632, QTY_EXP_RESULT#664, PY_ACT_NET_SALES#658, MP_NET_SALES#663, TRN_FORECAST_VAL#656, NS_BO1#675, NET_TRN_EXP_RESULT#665] : : : +- Join Inner, (CS_ID_CUSTOMER#621 = CS_ID_CUSTOMER#2) : : : :- Project [CS_ID_CUSTOMER#621, SO_ID_SALES_ORG#619, DC_ID_DISTR_CHANNEL#620, MAT_ID_MATERIAL#622, PY_ACT_PCS_TOT#657, MP_PCS_TOT#662, QTY_FORECAST_VAL#655, BACKORDER_QTY#632, QTY_EXP_RESULT#664, PY_ACT_NET_SALES#658, MP_NET_SALES#663, TRN_FORECAST_VAL#656, NS_BO1#675, NET_TRN_EXP_RESULT#665] : : : : +- Repartition 200, true : : : : +- Project [SO_ID_SALES_ORG#619, DC_ID_DISTR_CHANNEL#620, CS_ID_CUSTOMER#621, MAT_ID_MATERIAL#622, BACKORDER_QTY#632, QTY_FORECAST_VAL#655, TRN_FORECAST_VAL#656, PY_ACT_PCS_TOT#657, PY_ACT_NET_SALES#658, MP_PCS_TOT#662, MP_NET_SALES#663, QTY_EXP_RESULT#664, NET_TRN_EXP_RESULT#665, NS_BO1#675] : : : : +- Filter (isnotnull(FLAG_AGGR_XBU#683) && (FLAG_AGGR_XBU#683 = X)) : : : : +- Relation[DATE_NO#617,DATE_ID#618,SO_ID_SALES_ORG#619,DC_ID_DISTR_CHANNEL#620,CS_ID_CUSTOMER#621,MAT_ID_MATERIAL#622,CS_ID_CUST_SALES#623,SC_ID_SCENARIO#624,SC_ID_SCENARIO_QTY_VAL#625,CR_ID_CURRENCY#626,ORDERED_QTY#627,CONFIRMED_QTY#628,DELIVERABLE_QTY#629,SHIPPED_QTY#630,INVOICED_QTY#631,BACKORDER_QTY#632,BACKORDER_QTY_PREV#633,OOH_QTY#634,FWD_ORDER_QTY#635,PRICE_LIST#636,INVOICE_DISCOUNT#637,ACCRUALS#638,CONF_DEL_QTY#639,NET_TRN_INVOICED#640,... 43 more fields] JDBCRelation(DM_DAILY_SALES_QTY_TURNOVER) [numPartitions=1] : : : +- Repartition 200, true : : : +- Project [CS_ID_CUSTOMER#2] : : : +- Relation[CS_CUSTOMER_CODE#0,CS_DESCRIPTION#1,CS_ID_CUSTOMER#2,CS_MANDT#3,CT_COD_COUNTRY#4,CT_DESCRIPTION#5,TP_COD_TYPOLOGY#6,TP_DESCRIPTION#7,TP_DW_DESCRIPTION#8,H1_COD_HIER01#9,H1_DESCRIPTION_H01#10,H2_COD_HIER02#11,H2_DESCRIPTION_H02#12,H3_COD_HIER03#13,H3_DESCRIPTION_H03#14,H4_COD_HIER04#15,H4_DESCRIPTION_H04#16,DW_AUDIT_UPD#17,CS_CITY#18,CS_POSTAL_CODE#19,TP_COD_MC#20,TP_DESCRIPTION_MC#21,CS_DELETION_FLAG#22,CS_ORDER_BLOCK#23,... 79 more fields] JDBCRelation(DIM_CUSTOMER_MOVALL) [numPartitions=1] : : +- Repartition 200, true : : +- Project [SO_ID_SALES_ORG#221] : : +- Filter (isnotnull(SO_FLAG_SECTOR#233) && (SO_FLAG_SECTOR#233 = X)) : : +- Relation[SO_ID_SALES_ORG#221,SO_COD_SALES_ORG#222,SO_DESCRIPTION#223,SO_DW_DESCRIPTION#224,SO_CUTOFF_DATE#225,CT_COD_COUNTRY#226,CT_DESCRIPTION#227,CO_COD_COMPANY#228,CO_DESCRIPTION#229,OC_COD_OPER_CONCERN#230,OC_DESCRIPTION#231,SO_DM_SALES_LOADDATE#232,SO_FLAG_SECTOR#233,SO_FLAG_SO_TYPE#234,SO_COD_CURRENCY#235,OC_COD_CURRENCY#236] JDBCRelation(DIM_SALES_ORG_MOV) [numPartitions=1] : +- Repartition 200, true : +- Project [DC_ID_DISTR_CHANNEL#260] : +- Filter (isnotnull(DC_COD_CHANNEL#264) && (DC_COD_CHANNEL#264 = 1)) : +- Relation[DC_ID_DISTR_CHANNEL#260,DC_COD_DISTR_CHANNEL#261,DC_DW_DESCRIPTION#262,DC_DW_CHANNEL#263,DC_COD_CHANNEL#264,DC_DES_CHANNEL#265] JDBCRelation(DIM_DISTRIBUTION_CHANNEL_MOV) [numPartitions=1] +- Repartition 200, true +- Project [MAT_ID_MATERIAL#279] +- Filter ((isnotnull(BR_COD_BRAND#286) && (BR_COD_BRAND#286 = 13)) && (isnotnull(GC_COD_COMM_GROUP#300) && (GC_COD_COMM_GROUP#300 = 01))) +- Relation[MAT_ID_MATERIAL#279,MAT_DESCRIPTION#280,MAT_MATERIAL_CODE#281,MAT_FRONT_REAR#282,MAT_REF_MAT_NUM#283,MAT_GESTIONAL_STATUS#284,MAT_DW_FLAG_FITTIZIO#285,BR_COD_BRAND#286,BR_DESCRIPTION#287,BTS_COD_TREAD_PATTERN#288,BTS_DESCRIPTION#289,BU_COD_BU#290,BU_DESCRIPTION#291,BC_COD_BARCODE#292,BC_DESCRIPTION#293,CAL_COD_DIAMETER#294,CAL_DESCRIPTION#295,CL_COD_LOG_CATEGORY#296,CL_DESCRIPTION#297,CRD_COD_NOM_WIDTH#298,CRD_DESCRIPTION#299,GC_COD_COMM_GROUP#300,GC_DESCRIPTION_EN#301,IP5_COD_IP5#302,... 138 more fields] JDBCRelation(DIM_MATERIAL_MOVALL) [numPartitions=1] and Aggregate [collect_list(map(date, DW_DATE_RIF_FORMATTED#1836, val, cast(QTY_EXP_RESULT#2181 as string)), 0, 0) AS trend_pieces#2193, collect_list(map(date, DW_DATE_RIF_FORMATTED#1836, val, cast(NET_TRN_EXP_RESULT_UNS#2183 as string)), 0, 0) AS trend_economics#2195] +- Aggregate [DW_DATE_RIF_FORMATTED#1836], [DW_DATE_RIF_FORMATTED#1836, sum(QTY_EXP_RESULT#1335) AS QTY_EXP_RESULT#2181, sum(NET_TRN_EXP_RESULT_UNS#1382) AS NET_TRN_EXP_RESULT_UNS#2183] +- Project [QTY_EXP_RESULT#1335, NET_TRN_EXP_RESULT_UNS#1382, format_date_string(DW_DATE_RIF#1222) AS DW_DATE_RIF_FORMATTED#1836] +- Join Inner, (MAT_ID_MATERIAL#1177 = MAT_ID_MATERIAL#279) :- Project [MAT_ID_MATERIAL#1177, DW_DATE_RIF#1222, QTY_EXP_RESULT#1335, NET_TRN_EXP_RESULT_UNS#1382] : +- Join Inner, (DC_ID_DISTR_CHANNEL#1175 = DC_ID_DISTR_CHANNEL#260) : :- Project [DC_ID_DISTR_CHANNEL#1175, MAT_ID_MATERIAL#1177, DW_DATE_RIF#1222, QTY_EXP_RESULT#1335, NET_TRN_EXP_RESULT_UNS#1382] : : +- Join Inner, (SO_ID_SALES_ORG#1174 = SO_ID_SALES_ORG#221) : : :- Project [SO_ID_SALES_ORG#1174, DC_ID_DISTR_CHANNEL#1175, MAT_ID_MATERIAL#1177, DW_DATE_RIF#1222, QTY_EXP_RESULT#1335, NET_TRN_EXP_RESULT_UNS#1382] : : : +- Join Inner, (CS_ID_CUSTOMER#1176 = CS_ID_CUSTOMER#2) : : : :- Project [CS_ID_CUSTOMER#1176, SO_ID_SALES_ORG#1174, DC_ID_DISTR_CHANNEL#1175, MAT_ID_MATERIAL#1177, DW_DATE_RIF#1222, CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((QTY_CONFIRMED_NO_DEL#1204 + QTY_DELIVERIES_NO_ISSUED#1214), DecimalType(38,10)) + QTY_DELIVERY_IN_BILLING#1203), DecimalType(38,10)) + QTY_INVOICED#1202), DecimalType(38,10)) - QTY_CREDIT_BLOCK#1205), DecimalType(38,10)) + QTY_CREDIT_NOTE#1217), DecimalType(38,10)) + QTY_DEBIT_NOTE#1220), DecimalType(38,10)) AS QTY_EXP_RESULT#1335, CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((NET_TRN_CONFIRMED_NO_DEL#1199 + NET_TRN_DELIVERIES_NO_ISSUED#1213), DecimalType(38,10)) + NET_TRN_DELIVERIES_IN_BILLING#1197), DecimalType(38,10)) + NET_TRN_INVOICED#1195), DecimalType(38,10)) - NET_TRN_CREDIT_BLOCK#1201), DecimalType(38,10)) + REQ_NET_TRN_CREDIT_NOTE#1216), DecimalType(38,10)) + REQ_NET_TRN_DEBIT_NOTE#1219), DecimalType(38,10)) + ACCRUALS_CALC_NO_BLOCK#1231), DecimalType(38,10)) / CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((CheckOverflow((QTY_CONFIRMED_NO_DEL#1204 + QTY_DELIVERIES_NO_ISSUED#1214), DecimalType(38,10)) + QTY_DELIVERY_IN_BILLING#1203), DecimalType(38,10)) + QTY_INVOICED#1202), DecimalType(38,10)) - QTY_CREDIT_BLOCK#1205), DecimalType(38,10)) + QTY_CREDIT_NOTE#1217), DecimalType(38,10)) + QTY_DEBIT_NOTE#1220), DecimalType(38,10))), DecimalType(38,18)) AS NET_TRN_EXP_RESULT_UNS#1382] : : : : +- Repartition 200, true : : : : +- Project [SO_ID_SALES_ORG#1174, DC_ID_DISTR_CHANNEL#1175, CS_ID_CUSTOMER#1176, MAT_ID_MATERIAL#1177, NET_TRN_INVOICED#1195, NET_TRN_DELIVERIES_IN_BILLING#1197, NET_TRN_CONFIRMED_NO_DEL#1199, NET_TRN_CREDIT_BLOCK#1201, QTY_INVOICED#1202, QTY_DELIVERY_IN_BILLING#1203, QTY_CONFIRMED_NO_DEL#1204, QTY_CREDIT_BLOCK#1205, NET_TRN_DELIVERIES_NO_ISSUED#1213, QTY_DELIVERIES_NO_ISSUED#1214, REQ_NET_TRN_CREDIT_NOTE#1216, QTY_CREDIT_NOTE#1217, REQ_NET_TRN_DEBIT_NOTE#1219, QTY_DEBIT_NOTE#1220, DW_DATE_RIF#1222, ACCRUALS_CALC_NO_BLOCK#1231] : : : : +- Filter ((((((((DW_DATE_RIF#1222 >= 20160401) && (DW_DATE_RIF#1222 = 20170401)) && (isnotnull(FLAG_AGGR_XBU#1240) && (FLAG_AGGR_XBU#1240 = Y))) && isnotnull(CS_ID_CUSTOMER#1176)) && isnotnull(SO_ID_SALES_ORG#1174)) && ((isnotnull(DC_ID_DISTR_CHANNEL#1175) && isnotnull(DW_DATE_RIF#1222)) && (DW_DATE_RIF#1222 >= 20170401))) && isnotnull(MAT_ID_MATERIAL#1177)) : : : : +- Relation[DATE_ID#1173,SO_ID_SALES_ORG#1174,DC_ID_DISTR_CHANNEL#1175,CS_ID_CUSTOMER#1176,MAT_ID_MATERIAL#1177,ORDERED_QTY#1178,CONFIRMED_QTY#1179,DELIVERABLE_QTY#1180,SHIPPED_QTY#1181,INVOICED_QTY#1182,BACKORDER_QTY#1183,OOH_QTY#1184,FWD_ORDER_QTY#1185,DW_AUDIT_ID#1186,PRICE_LIST#1187,CR_ID_CURRENCY#1188,INVOICE_DISCOUNT#1189,ACCRUALS#1190,DW_LAST_UPD#1191,BACKORDER_QTY_PREV#1192,CONF_DEL_QTY#1193,TRN_INVOICED#1194,NET_TRN_INVOICED#1195,TRN_DELIVERIES_IN_BILLING#1196,... 44 more fields] JDBCRelation(F_DAILY_SALES_QTY_TURNOVER) [numPartitions=1] : : : +- Repartition 200, true : : : +- Project [CS_ID_CUSTOMER#2] : : : +- Relation[CS_CUSTOMER_CODE#0,CS_DESCRIPTION#1,CS_ID_CUSTOMER#2,CS_MANDT#3,CT_COD_COUNTRY#4,CT_DESCRIPTION#5,TP_COD_TYPOLOGY#6,TP_DESCRIPTION#7,TP_DW_DESCRIPTION#8,H1_COD_HIER01#9,H1_DESCRIPTION_H01#10,H2_COD_HIER02#11,H2_DESCRIPTION_H02#12,H3_COD_HIER03#13,H3_DESCRIPTION_H03#14,H4_COD_HIER04#15,H4_DESCRIPTION_H04#16,DW_AUDIT_UPD#17,CS_CITY#18,CS_POSTAL_CODE#19,TP_COD_MC#20,TP_DESCRIPTION_MC#21,CS_DELETION_FLAG#22,CS_ORDER_BLOCK#23,... 79 more fields] JDBCRelation(DIM_CUSTOMER_MOVALL) [numPartitions=1] : : +- Repartition 200, true : : +- Project [SO_ID_SALES_ORG#221] : : +- Filter (isnotnull(SO_FLAG_SECTOR#233) && (SO_FLAG_SECTOR#233 = X)) : : +- Relation[SO_ID_SALES_ORG#221,SO_COD_SALES_ORG#222,SO_DESCRIPTION#223,SO_DW_DESCRIPTION#224,SO_CUTOFF_DATE#225,CT_COD_COUNTRY#226,CT_DESCRIPTION#227,CO_COD_COMPANY#228,CO_DESCRIPTION#229,OC_COD_OPER_CONCERN#230,OC_DESCRIPTION#231,SO_DM_SALES_LOADDATE#232,SO_FLAG_SECTOR#233,SO_FLAG_SO_TYPE#234,SO_COD_CURRENCY#235,OC_COD_CURRENCY#236] JDBCRelation(DIM_SALES_ORG_MOV) [numPartitions=1] : +- Repartition 200, true : +- Project [DC_ID_DISTR_CHANNEL#260] : +- Filter (isnotnull(DC_COD_CHANNEL#264) && (DC_COD_CHANNEL#264 = 1)) : +- Relation[DC_ID_DISTR_CHANNEL#260,DC_COD_DISTR_CHANNEL#261,DC_DW_DESCRIPTION#262,DC_DW_CHANNEL#263,DC_COD_CHANNEL#264,DC_DES_CHANNEL#265] JDBCRelation(DIM_DISTRIBUTION_CHANNEL_MOV) [numPartitions=1] +- Repartition 200, true +- Project [MAT_ID_MATERIAL#279] +- Filter ((isnotnull(BR_COD_BRAND#286) && (BR_COD_BRAND#286 = 13)) && (isnotnull(GC_COD_COMM_GROUP#300) && (GC_COD_COMM_GROUP#300 = 01))) +- Relation[MAT_ID_MATERIAL#279,MAT_DESCRIPTION#280,MAT_MATERIAL_CODE#281,MAT_FRONT_REAR#282,MAT_REF_MAT_NUM#283,MAT_GESTIONAL_STATUS#284,MAT_DW_FLAG_FITTIZIO#285,BR_COD_BRAND#286,BR_DESCRIPTION#287,BTS_COD_TREAD_PATTERN#288,BTS_DESCRIPTION#289,BU_COD_BU#290,BU_DESCRIPTION#291,BC_COD_BARCODE#292,BC_DESCRIPTION#293,CAL_COD_DIAMETER#294,CAL_DESCRIPTION#295,CL_COD_LOG_CATEGORY#296,CL_DESCRIPTION#297,CRD_COD_NOM_WIDTH#298,CRD_DESCRIPTION#299,GC_COD_COMM_GROUP#300,GC_DESCRIPTION_EN#301,IP5_COD_IP5#302,... 138 more fields] JDBCRelation(DIM_MATERIAL_MOVALL) [numPartitions=1] Join condition is missing or trivial. Use the CROSS JOIN syntax to allow cartesian products between these relations.;'
我也尝试了以下语法,但没有任何改变:
kpi.join(mtd, on=[col("kpi.label") == col("mtd.label"), col("kpi.season") == col("mtd.season"), col("kpi.segment") == col("mtd.segment"), col("kpi.kind") == col("mtd.kind"), col("kpi.brand") == col("mtd.brand"), col("kpi.parent") == col("mtd.parent")])
一个有趣的事情是,如果我尝试执行外部联接,则异常消失,但即使它们存在,也无法在正确的数据框中找到匹配项。
kpi.join(mtd, on=[col("kpi.label") == col("mtd.label"), col("kpi.season") == col("mtd.season"), col("kpi.segment") == col("mtd.segment"), col("kpi.kind") == col("mtd.kind"), col("kpi.brand") == col("mtd.brand"), col("kpi.parent") == col("mtd.parent")], how="outer") .show()
+-----+------+-------+-----+-----+------+--------------------+---------------+ |label|season|segment| kind|brand|parent| kpi| mtd| +-----+------+-------+-----+-----+------+--------------------+---------------+ |world| all| total|world| all| null|Map(PY -> 1234, B...|null | +-----+------+-------+-----+-----+------+--------------------+---------------+
我可以做些什么来进一步调查此事?
提前致谢!