我在pyspark
中有一个如下所示的数据框。
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| serial_number | rest_id | value | body | legs | face | idle |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11 | rs1 | N | Y | N | N | acde |
| sn1 | rs1 | N | Y | N | N | den |
| sn1 | null | Y | N | Y | N | can |
| sn2 | rs2 | Y | Y | N | N | aeg |
| null | rs2 | N | Y | N | Y | ueg |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
现在,我想在检查某些列值时update
部分列。
我希望在任何给定的value
或serial_number
具有值rest_id
时更新Y
,然后更新values
serial_number
或者rest_id
应该更新为Y.如果没有,那么它们的值是什么。
我在下面做过。
df.alias('a').join(df.filter(col('value')='Y').alias('b'),on=(col('a.serial_number') == col('b.serial_number')) | (col('a.rest_id') == col('b.rest_id')), how='left').withColumn('final_value',when(col('b.value').isNull(), col('a.value')).otherwise(col('b.value'))).select('a.serial_number','a.rest_id','a.body', 'a.legs', 'a.face', 'a.idle', 'final_val')
我得到了我想要的结果。
现在我想对列body
,legs
和face
重复相同的内容。
我可以在上面对所有列individually
执行上述操作,我的意思是说3
加入语句。但我想在一个语句中更新所有4
列。
我该怎么做?
Expected result
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| serial_number | rest_id | value | body | legs | face | idle |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
| sn11 | rs1 | N | Y | N | N | acde |
| sn1 | rs1 | Y | Y | Y | N | den |
| sn1 | null | Y | Y | Y | N | can |
| sn2 | rs2 | Y | Y | N | Y | aeg |
| null | rs2 | Y | Y | N | Y | ueg |
+--------------------+--------------+------------+-----------+-----------+-----------+-----------+
答案 0 :(得分:1)
您应该对window
和serial_number
列使用rest_id
函数,以检查该组中的列中是否存在 Y 。 (评论如下所述)
#column names for looping for the updates
columns = ["value","body","legs","face"]
import sys
from pyspark.sql import window as w
#window for serial number grouping
windowSpec1 = w.Window.partitionBy('serial_number').rowsBetween(-sys.maxint, sys.maxint)
#window for rest id grouping
windowSpec2 = w.Window.partitionBy('rest_id').rowsBetween(-sys.maxint, sys.maxint)
from pyspark.sql import functions as f
from pyspark.sql import types as t
#udf function for checking if Y is in the collected list of windows defined above for the columns in the list defined for looping
def containsUdf(x):
return "Y" in x
containsUdfCall = f.udf(containsUdf, t.BooleanType())
#looping the columns for checking the condition defined in udf function above by collecting the N and Y in each columns within windows defined
for column in columns:
df = df.withColumn(column, f.when(containsUdfCall(f.collect_list(column).over(windowSpec1)) | containsUdfCall(f.collect_list(column).over(windowSpec2)), "Y").otherwise(df[column]))
df.show(truncate=False)
应该给你
+-------------+-------+-----+----+----+----+----+
|serial_number|rest_id|value|body|legs|face|idle|
+-------------+-------+-----+----+----+----+----+
|sn2 |rs2 |Y |Y |N |Y |aeg |
|null |rs2 |Y |Y |N |Y |ueg |
|sn11 |rs1 |N |Y |N |N |acde|
|sn1 |rs1 |Y |Y |Y |N |den |
|sn1 |null |Y |Y |Y |N |can |
+-------------+-------+-----+----+----+----+----+
我建议在两个循环中单独使用窗口函数,因为它可能会为大数据提供内存异常,因为每个行同时使用两个窗口函数