我的数据框类似于以下内容:
df = spark.createDataFrame([(0, "departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue"), (1, "departmentcode__10~#~p99189h8pk0__10484~#~prod_productcolor__Dustysalmon Black","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue"), (2, "departmentcode__60~#~p99189h8pk0__10485~#~prod_productcolor__Dustysalmon White","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue"), (3, "departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink")], ["id", "left", "right"])
我需要创建一个类似于以下内容的新数据框:
这里为id 0和id 3左右交换,在这种情况下,我需要创建一个名为new_id的新列,其中new_id是替代id。 (对于id 0,它是3,对于id 3 new_id是0.对于rest它是null(iff找不到匹配))
------------------+
|id |left|right |new_id
-+-----------------------------------------------------------------------------+
|0 |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|3
|1 |departmentcode__10~#~p99189h8pk0__10484~#~prod_productcolor__Dustysalmon Black|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null
|2 |departmentcode__60~#~p99189h8pk0__10485~#~prod_productcolor__Dustysalmon White|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null
|3 |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink|0
答案 0 :(得分:1)
您只需要一个左自我加入,但条件如下
from pyspark.sql import functions as f
df.alias('df1').join(df.alias('df2'), on=((f.col('df1.left') == f.col('df2.right')) & (f.col('df1.right') == f.col('df2.left'))), how='left')\
.select(f.col('df1.id'), f.col('df1.left'), f.col('df1.right'), f.col('df2.id').alias('new_id'))\
.show(truncate=False)
应该给你
+---+------------------------------------------------------------------------------+-----------------------------------------------------------------------------+------+
|id |left |right |new_id|
+---+------------------------------------------------------------------------------+-----------------------------------------------------------------------------+------+
|2 |departmentcode__60~#~p99189h8pk0__10485~#~prod_productcolor__Dustysalmon White|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null |
|0 |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|3 |
|3 |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink|0 |
|1 |departmentcode__10~#~p99189h8pk0__10484~#~prod_productcolor__Dustysalmon Black|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null |
+---+------------------------------------------------------------------------------+-----------------------------------------------------------------------------+------+
我希望答案很有帮助