Question

如果有一个DataFrame，并希望根据行的值对函数中的数据进行一些操作。

my_udf(row):
    threshold = 10
        if row.val_x > threshold
        row.val_x = another_function(row.val_x)
        row.val_y = another_function(row.val_y)
        return row
    else:
        return row

有谁知道如何将我的udf应用于DataFrame？

Answer 1

如果你可以使用pyspark函数，最好不要使用UDF，如果你不能将another_function翻译成pyspark函数，你可以这样做：

from pyspark.sql.types import *
import pyspark.sql.functions as psf

def another_function(val):
    ...

another_function_udf = psf.udf(another_function, [outputType()])

其中outputType()是与another_function（IntegerType()，StringType()的输出相对应的pyspark类型...）

def apply_another_function(val):
    return psf.when(df.val_x > threshold, another_function_udf(val)).otherwise(val)

df = df.withColumn('val_y', apply_another_function(df.val_y))\
       .withColumn('val_x', apply_another_function(df.val_x))

Answer 2

根据我的理解，udf参数是列名。你的例子可能会改写如下：

from pyspark.sql.functions import udf, array
from pyspark.sql.types import IntegerType

def change_val_x(val_x):
    threshold = 10
    if val_x > threshold:
        return another_function(val_x)
    else:
        return val_x

def change_val_y(arr):
    threshold = 10
    # arr[0] -> val_x, arr[0] -> val_y 
    if arr[0] > threshold:
        return another_function(arr[1])
    else:
        return val_y

change_val_x_udf = udf(change_val_x, IntegerType())
change_val_y_udf = udf(change_val_y, IntegerType())

# apply these functions to your dataframe
df = df.withColumn('val_y', change_val_y_udf(array('val_x', 'val_y')))\
       .withColumn('val_x', change_val_x_udf('val_x'))

要修改val_x列，一个简单的udf就足够了，但是对于val_y，你需要val_y和val_x列值，解决方法是使用array。请注意，此代码未经过测试...

请参阅this question以在多列上应用udf。

PySpark中

2 个答案: