pyspark:根据其他记录获取列

时间:2019-04-18 00:42:13

标签: python join pyspark

我有一个看起来像这样的数据框

membershipAccountNbr            cntryRetailChannelCustId
111590058               1010015900581000010101
214100897               1010041008972100010101
104100897               1010041008971000010101

还有另一个看起来像这样的

membershipAccountNbr    parentMembershipNbr
111590058                   111590058
214100897                   104100897

我的目标是使其外观如下:

membershipAccountNbr parentMembershipNbr parentCustId
111590058               111590058    1010015900581000010101
214100897               104100897    1010041008971000010101

我尝试使用联接,但是它们给出了歧义错误。我是Pyspark的新手,所以请帮忙。

1 个答案:

答案 0 :(得分:1)

假设df1

+--------------------+------------------------+
|membershipAccountNbr|cntryRetailChannelCustId|
+--------------------+------------------------+
|           111590058|    10100159005810000...|
|           214100897|    10100410089721000...|
|           104100897|    10100410089710000...|
+--------------------+------------------------+

还有df2

+--------------------+-------------------+
|membershipAccountNbr|parentMembershipNbr|
+--------------------+-------------------+
|           111590058|          111590058|
|           214100897|          104100897|
+--------------------+-------------------+

然后您运行

df1.join(df2, on="membershipAccountNbr", how="right").select(
    col("membershipAccountNbr"),
    col("parentMembershipNbr"),
    col("cntryRetailChannelCustId").alias("parentCustId"),
).show()

结果将如下所示,

+--------------------+-------------------+--------------------+
|membershipAccountNbr|parentMembershipNbr|        parentCustId|
+--------------------+-------------------+--------------------+
|           111590058|          111590058|10100159005810000...|
|           214100897|          104100897|10100410089721000...|
+--------------------+-------------------+--------------------+