给出以下Python函数:
def f(col):
return col
如果我将其转换为UDF并将其应用于列对象,它将起作用...
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
df = spark.range(10)
udf = F.udf(f, returnType=DoubleType()).asNondeterministic()
df.withColumn('new', udf(F.lit(0))).show()
...除非该列是由rand
生成的:
df.withColumn('new', udf(F.rand())).show() # fails
但是,以下两项工作:
df.withColumn('new', F.rand()).show()
df.withColumn('new', F.rand()).withColumn('new2', udf(F.col('new'))).show()
错误:
Py4JJavaError: An error occurred while calling o469.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 34, localhost, executor driver): java.lang.NullPointerException
为什么会发生这种情况,以及如何使用在UDF中创建的rand
列表达式?
答案 0 :(得分:12)
核心问题是,JVM端的rand()函数依赖于临时rng变量,该变量在序列化/反序列化后无法生存,并且与eval
实现无效(在RDG类中定义,兰德子类here)。据我所知,rand()
和randn()
是Spark中具有这些特定属性的唯一函数
在编写udf(F.rand())
时,spark将其评估为单个PythonUDF表达式,从而在command_pickle中序列化rand()调用,从而丢失了初始化的瞬态。这可以从执行计划中观察到:
df.withColumn('new', udf(F.rand())).explain()
== Physical Plan ==
*(2) Project [id#0L, pythonUDF0#95 AS new#92]
+- BatchEvalPython [f(rand(-6878806567622466209))], [id#0L, pythonUDF0#95]
+- *(1) Range (0, 10, step=1, splits=8)
不幸的是,如果不通过火花修复使Rand类为null安全,就不可能克服这个问题,但是,如果您只需要生成随机数,则可以围绕Python随机生成器轻松构建自己的rand()udf:
from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
from random import random
def f(col):
return col
df = spark.range(10)
udf = F.udf(f, returnType=DoubleType()).asNondeterministic()
rand = F.udf(random, returnType=DoubleType()).asNondeterministic()
df.withColumn('new', udf(rand())).show()
+---+-------------------+
| id| new|
+---+-------------------+
| 0| 0.4384090392727712|
| 1| 0.5827392568376621|
| 2| 0.4249312702725516|
| 3| 0.8423409231783007|
| 4|0.39533981334524604|
| 5| 0.7073194901736066|
| 6|0.19176164335919255|
| 7| 0.7296698171715453|
| 8| 0.799510901886918|
| 9|0.12662129139761658|
+---+-------------------+