根据pyspark数据框中的两列更新多个列

时间:2018-05-30 04:52:47

标签: python apache-spark pyspark

我在pyspark中有一个如下所示的数据框。

+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
|     serial_number  |     rest_id  |     value  |     body  |     legs  |     face  |     idle  |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11               | rs1          | N          | Y         | N         | N         | acde      |
| sn1                | rs1          | N          | Y         | N         | N         | den       |
| sn1                | null         | Y          | N         | Y         | N         | can       |
| sn2                | rs2          | Y          | Y         | N         | N         | aeg       |
| null               | rs2          | N          | Y         | N         | Y         | ueg       |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+

现在,我想在检查某些列值时update部分列。

我希望在任何给定的valueserial_number具有值rest_id时更新Y,然后更新values serial_number或者rest_id应该更新为Y.如果没有,那么它们的值是什么。

我在下面做过。

df.alias('a').join(df.filter(col('value')='Y').alias('b'),on=(col('a.serial_number') == col('b.serial_number')) | (col('a.rest_id') == col('b.rest_id')), how='left').withColumn('final_value',when(col('b.value').isNull(), col('a.value')).otherwise(col('b.value'))).select('a.serial_number','a.rest_id','a.body', 'a.legs', 'a.face', 'a.idle', 'final_val')

我得到了我想要的结果。

现在我想对列bodylegsface重复相同的内容。

我可以在上面对所有列individually执行上述操作,我的意思是说3加入语句。但我想在一个语句中更新所有4列。

我该怎么做?

Expected result

+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
|     serial_number  |     rest_id  |     value  |     body  |     legs  |     face  |     idle  |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11               | rs1          | N          | Y         | N         | N         | acde      |
| sn1                | rs1          | Y          | Y         | Y         | N         | den       |
| sn1                | null         | Y          | Y         | Y         | N         | can       |
| sn2                | rs2          | Y          | Y         | N         | Y         | aeg       |
| null               | rs2          | Y          | Y         | N         | Y         | ueg       |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+

1 个答案:

答案 0 :(得分:1)

您应该对windowserial_number列使用rest_id函数,以检查该组中的列中是否存在 Y 。 (评论如下所述)

#column names for looping for the updates
columns = ["value","body","legs","face"]
import sys
from pyspark.sql import window as w
#window for serial number grouping
windowSpec1 = w.Window.partitionBy('serial_number').rowsBetween(-sys.maxint, sys.maxint)
#window for rest id grouping
windowSpec2 = w.Window.partitionBy('rest_id').rowsBetween(-sys.maxint, sys.maxint)

from pyspark.sql import functions as f
from pyspark.sql import types as t
#udf function for checking if Y is in the collected list of windows defined above for the columns in the list defined for looping
def containsUdf(x):
    return "Y" in x

containsUdfCall = f.udf(containsUdf, t.BooleanType())

#looping the columns for checking the condition defined in udf function above by collecting the N and Y in each columns within windows defined
for column in columns:
    df = df.withColumn(column, f.when(containsUdfCall(f.collect_list(column).over(windowSpec1)) | containsUdfCall(f.collect_list(column).over(windowSpec2)), "Y").otherwise(df[column]))

df.show(truncate=False)

应该给你

+-------------+-------+-----+----+----+----+----+
|serial_number|rest_id|value|body|legs|face|idle|
+-------------+-------+-----+----+----+----+----+
|sn2          |rs2    |Y    |Y   |N   |Y   |aeg |
|null         |rs2    |Y    |Y   |N   |Y   |ueg |
|sn11         |rs1    |N    |Y   |N   |N   |acde|
|sn1          |rs1    |Y    |Y   |Y   |N   |den |
|sn1          |null   |Y    |Y   |Y   |N   |can |
+-------------+-------+-----+----+----+----+----+

我建议在两个循环中单独使用窗口函数,因为它可能会为大数据提供内存异常,因为每个行同时使用两个窗口函数