My dataframe undergoes two consecutive filtering passes each using a boolean-valued UDF. The first filtering removes all rows whose columns are not present as keys in some broadcast dictionary. The second filtering imposes thresholds on values that this dictionary associates with the present keys.
If I display the result after just the first filtering, the row with 'c' is not in it, as expected. However, attempts to display the result of the second filtering lead to a KeyError exception for u'c'
sc = SparkContext()
ss = SparkSession(sc)
mydict={ "a" : 4, "b" : 6 }
mydict_bc = sc.broadcast(mydict)
udf_indict=func.udf( lambda x: x in mydict_bc.value, BooleanType() )
udf_bigenough=func.udf( lambda x: mydict_bc.value[x] > 5, BooleanType() )
df=ss.createDataFrame([ "a", "b", "c" ], StringType() ).toDF("name")
df1 = df.where( udf_indict('name') )
df1.show()
+----+
|name|
+----+
| a|
| b|
+----+
df1.where( udf_bigenough('name') ).show()
KeyError: u'c'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
...
I guess this has something to do with delayed execution and internal optimization, but is this really an expected behavior?
Thanks
答案 0 :(得分:1)
这个
我的数据框经历了两次连续的过滤传递
是不正确的假设。与RDD
不同,所有转换都是WYSIWYG,SQL API纯粹是声明性的。它解释了必须做什么,而不是如何做。优化器可以根据需要重新排列所有元素。
使用非确定性变体将禁用优化:
df1 = df.where( udf_indict.asNondeterministic()('name'))
df1.where( udf_bigenough.asNondeterministic()('name') ).show()
但你应该真正处理异常
@udf(BooleanType())
def udf_bigenough(x):
try:
return mydict_bc.get(x) > 5
except TypeError:
pass
或更好,不要使用udf
。