通过在pyspark中查找swaped id来创建新的数据帧

时间:2018-06-11 06:14:07

标签: python-3.x apache-spark pyspark

我的数据框类似于以下内容:

df = spark.createDataFrame([(0, "departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue"), (1, "departmentcode__10~#~p99189h8pk0__10484~#~prod_productcolor__Dustysalmon Black","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue"), (2, "departmentcode__60~#~p99189h8pk0__10485~#~prod_productcolor__Dustysalmon White","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue"), (3, "departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue","departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink")], ["id", "left", "right"])

我需要创建一个类似于以下内容的新数据框:

这里为id 0和id 3左右交换,在这种情况下,我需要创建一个名为new_id的新列,其中new_id是替代id。 (对于id 0,它是3,对于id 3 new_id是0.对于rest它是null(iff找不到匹配))

------------------+
|id |left|right |new_id
-+-----------------------------------------------------------------------------+
|0     |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink     |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|3
|1  |departmentcode__10~#~p99189h8pk0__10484~#~prod_productcolor__Dustysalmon Black|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null
|2  |departmentcode__60~#~p99189h8pk0__10485~#~prod_productcolor__Dustysalmon White|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null
|3  |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink|0

1 个答案:

答案 0 :(得分:1)

您只需要一个左自我加入,但条件如下

from pyspark.sql import functions as f
df.alias('df1').join(df.alias('df2'), on=((f.col('df1.left') == f.col('df2.right')) & (f.col('df1.right') == f.col('df2.left'))), how='left')\
    .select(f.col('df1.id'), f.col('df1.left'), f.col('df1.right'), f.col('df2.id').alias('new_id'))\
    .show(truncate=False)

应该给你

+---+------------------------------------------------------------------------------+-----------------------------------------------------------------------------+------+
|id |left                                                                          |right                                                                        |new_id|
+---+------------------------------------------------------------------------------+-----------------------------------------------------------------------------+------+
|2  |departmentcode__60~#~p99189h8pk0__10485~#~prod_productcolor__Dustysalmon White|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null  |
|0  |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|3     |
|3  |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue |departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Pink|0     |
|1  |departmentcode__10~#~p99189h8pk0__10484~#~prod_productcolor__Dustysalmon Black|departmentcode__50~#~p99189h8pk0__10483~#~prod_productcolor__Dustysalmon Blue|null  |
+---+------------------------------------------------------------------------------+-----------------------------------------------------------------------------+------+

我希望答案很有帮助