pyspark dataframe filtering doesn't really remove rows?

时间:2018-01-27 04:58:44

标签: apache-spark dataframe pyspark apache-spark-sql user-defined-functions

My dataframe undergoes two consecutive filtering passes each using a boolean-valued UDF. The first filtering removes all rows whose columns are not present as keys in some broadcast dictionary. The second filtering imposes thresholds on values that this dictionary associates with the present keys.

If I display the result after just the first filtering, the row with 'c' is not in it, as expected. However, attempts to display the result of the second filtering lead to a KeyError exception for u'c'

sc = SparkContext()
ss = SparkSession(sc)

mydict={ "a" : 4, "b" : 6 }
mydict_bc = sc.broadcast(mydict)

udf_indict=func.udf( lambda x: x in mydict_bc.value, BooleanType() )
udf_bigenough=func.udf( lambda x: mydict_bc.value[x] > 5, BooleanType() )

df=ss.createDataFrame([ "a", "b", "c" ], StringType() ).toDF("name")
df1 = df.where( udf_indict('name') )
df1.show()

    +----+                                                                          
    |name|
    +----+
    |   a|
    |   b|
    +----+

df1.where( udf_bigenough('name') ).show()

KeyError: u'c'
at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:144)
at org.apache.spark.sql.execution.python.BatchEvalPythonExec$$anonfun$doExecute$1.apply(BatchEvalPythonExec.scala:87)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$23.apply(RDD.scala:797)
    ... 

I guess this has something to do with delayed execution and internal optimization, but is this really an expected behavior?

Thanks

1 个答案:

答案 0 :(得分:1)

这个

  

我的数据框经历了两次连续的过滤传递

是不正确的假设。与RDD不同,所有转换都是WYSIWYG,SQL API纯粹是声明性的。它解释了必须做什么,而不是如何做。优化器可以根据需要重新排列所有元素。

使用非确定性变体将禁用优化:

df1 = df.where( udf_indict.asNondeterministic()('name'))
df1.where( udf_bigenough.asNondeterministic()('name') ).show()

但你应该真正处理异常

@udf(BooleanType())
   def udf_bigenough(x):
      try:
          return mydict_bc.get(x) > 5
      except TypeError:
          pass

或更好,不要使用udf