为什么在由rand()生成的列上运行的PySpark UDF失败?

时间:2019-04-24 05:59:57

标签: python apache-spark pyspark

给出以下Python函数:

def f(col):
    return col

如果我将其转换为UDF并将其应用于列对象,它将起作用...

from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType

df = spark.range(10)
udf = F.udf(f, returnType=DoubleType()).asNondeterministic()

df.withColumn('new', udf(F.lit(0))).show()

...除非该列是由rand生成的:

df.withColumn('new', udf(F.rand())).show()  # fails

但是,以下两项工作:

df.withColumn('new', F.rand()).show()
df.withColumn('new', F.rand()).withColumn('new2', udf(F.col('new'))).show()

错误:

Py4JJavaError: An error occurred while calling o469.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 20.0 failed 1 times, most recent failure: Lost task 0.0 in stage 20.0 (TID 34, localhost, executor driver): java.lang.NullPointerException

为什么会发生这种情况,以及如何使用在UDF中创建的rand列表达式?

1 个答案:

答案 0 :(得分:12)

核心问题是,JVM端的rand()函数依赖于临时rng变量,该变量在序列化/反序列化后无法生存,并且与eval实现无效(在RDG类中定义,兰德子类here)。据我所知,rand()randn()是Spark中具有这些特定属性的唯一函数

在编写udf(F.rand())时,spark将其评估为单个PythonUDF表达式,从而在command_pickle中序列化rand()调用,从而丢失了初始化的瞬态。这可以从执行计划中观察到:

df.withColumn('new', udf(F.rand())).explain()

== Physical Plan ==
*(2) Project [id#0L, pythonUDF0#95 AS new#92]
+- BatchEvalPython [f(rand(-6878806567622466209))], [id#0L, pythonUDF0#95]
   +- *(1) Range (0, 10, step=1, splits=8)

不幸的是,如果不通过火花修复使Rand类为null安全,就不可能克服这个问题,但是,如果您只需要生成随机数,则可以围绕Python随机生成器轻松构建自己的rand()udf:

from pyspark.sql import functions as F
from pyspark.sql.types import DoubleType
from random import random

def f(col):
    return col

df = spark.range(10)
udf = F.udf(f, returnType=DoubleType()).asNondeterministic()
rand = F.udf(random, returnType=DoubleType()).asNondeterministic()

df.withColumn('new', udf(rand())).show()

+---+-------------------+
| id|                new|
+---+-------------------+
|  0| 0.4384090392727712|
|  1| 0.5827392568376621|
|  2| 0.4249312702725516|
|  3| 0.8423409231783007|
|  4|0.39533981334524604|
|  5| 0.7073194901736066|
|  6|0.19176164335919255|
|  7| 0.7296698171715453|
|  8|  0.799510901886918|
|  9|0.12662129139761658|
+---+-------------------+