我有一个看起来像这样的数据框
membershipAccountNbr cntryRetailChannelCustId
111590058 1010015900581000010101
214100897 1010041008972100010101
104100897 1010041008971000010101
还有另一个看起来像这样的
membershipAccountNbr parentMembershipNbr
111590058 111590058
214100897 104100897
我的目标是使其外观如下:
membershipAccountNbr parentMembershipNbr parentCustId
111590058 111590058 1010015900581000010101
214100897 104100897 1010041008971000010101
我尝试使用联接,但是它们给出了歧义错误。我是Pyspark的新手,所以请帮忙。
答案 0 :(得分:1)
假设df1
是
+--------------------+------------------------+
|membershipAccountNbr|cntryRetailChannelCustId|
+--------------------+------------------------+
| 111590058| 10100159005810000...|
| 214100897| 10100410089721000...|
| 104100897| 10100410089710000...|
+--------------------+------------------------+
还有df2
,
+--------------------+-------------------+
|membershipAccountNbr|parentMembershipNbr|
+--------------------+-------------------+
| 111590058| 111590058|
| 214100897| 104100897|
+--------------------+-------------------+
然后您运行
df1.join(df2, on="membershipAccountNbr", how="right").select(
col("membershipAccountNbr"),
col("parentMembershipNbr"),
col("cntryRetailChannelCustId").alias("parentCustId"),
).show()
结果将如下所示,
+--------------------+-------------------+--------------------+
|membershipAccountNbr|parentMembershipNbr| parentCustId|
+--------------------+-------------------+--------------------+
| 111590058| 111590058|10100159005810000...|
| 214100897| 104100897|10100410089721000...|
+--------------------+-------------------+--------------------+