pyspark数据框UDF异常处理

时间:2018-05-06 17:21:24

标签: apache-spark exception-handling pyspark spark-dataframe user-defined-functions

我已经编写了一个使用python在spark中使用的UDF。这个功能需要 一个日期(在字符串中,例如' 2017-01-06')和 一个字符串数组(例如:[2017-01-26,2017-02-26,2017-04-17]) 并返回自上次最近的日期以来的#days。 UDF是

def findClosestPreviousDate(currdate, date_list):
    date_format = "%Y-%m-%d"
    currdate = datetime.datetime.strptime(currdate, date_format)
    result = currdate
    date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
    lowestdiff = 10000
    for dt in date_list:
        if(dt >= currdate):
            continue
        delta = currdate-dt
        diff = delta.days
        if(diff < lowestdiff):
            lowestdiff = diff
            result = dt
    dlt = currdate-result
    return dlt.days


findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())

我称之为

findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
grouped_extend_df2 = grouped_extend_email_rec.withColumn('recency_eng', func.when(size(col("activity_arr")) > 0, findClosestPreviousDateUdf("expanded_datestr", "activity_arr")).otherwise(0))

即使我删除了列中的所有空值&#34; activity_arr&#34;我继续得到 NoneType 错误。尝试在功能内部进行应急处理(仍然相同)。

我们是否有更好的方法在运行时从UDF捕获错误记录(可能正在使用累加器左右,我看到很少有人使用scala尝试相同)

ERROR:

  

----------------------------------------------- ---------------------------- Py4JJavaError Traceback(最近的电话   最后)in()   ----&GT; 1 grouped_extend_df2.show()

     

/usr/lib/spark/python/pyspark/sql/dataframe.pyc in show(self,n,   截短)       334&#34;&#34;&#34;       335 if isinstance(truncate,bool)和truncate:    - &GT; 336打印(self._jdf.showString(n,20))       337其他:       338 print(self._jdf.showString(n,int(truncate)))

     

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in   调用(self,* args)1131 answer = self.gateway_client.send_command(command)1132 return_value   = get_return_value(    - &GT; 1133回答,self.gateway_client,self.target_id,self.name)1134 1135 for temp_arg in temp_args:

     deco中的/usr/lib/spark/python/pyspark/sql/utils.pyc(* a,** kw)        61 def deco(* a,** kw):        62尝试:   ---&GT; 63返回f(* a,** kw)        64除了py4j.protocol.Py4JJavaError为e:        65 s = e.java_exception.toString()

     

/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in   get_return_value(answer,gateway_client,target_id,name)       317引发Py4JJavaError(       318&#34;调用{0} {1} {2}时发生错误。\ n&#34;。    - &GT; 319格式(target_id,&#34;。&#34;,名称),值)       320其他:       321提出Py4JError(

     

Py4JJavaError:调用o1111.showString时发生错误。 :   org.apache.spark.SparkException:作业因阶段失败而中止:   阶段315.0中的任务0失败1次,最近失败:丢失任务   阶段315.0中的0.0(TID 18390,localhost,执行程序驱动程序):org.apache.spark.api.python.PythonException:Traceback(最新版本)   最后打电话):文件   &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第177行,   在主要       process()File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第172行,   正在进行中       serializer.dump_stream(func(split_index,iterator),outfile)File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;, line   104,在       func = lambda _,it:map(mapper,it)File&#34;&#34;,第1行,在File中   &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第71行,在          返回lambda * a:f(* a)文件&#34;&#34;,第5行,在findClosestPreviousDate中TypeError:&#39; NoneType&#39;对象不是   可迭代

     

在   org.apache.spark.api.python.PythonRunner $$匿名$ 1.read(PythonRDD.scala:193)     在   org.apache.spark.api.python.PythonRunner $$匿名$ 1(PythonRDD.scala:234)。     在   org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)     在   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:144)     在   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:87)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:108)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)     在java.lang.Thread.run(Thread.java:748)

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1517)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1505)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1504)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814)     在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)at at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)     在   org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)     在   org.apache.spark.sql.Dataset.org $阿帕奇$火花$ SQL $数据集$$ collectFromPlan(Dataset.scala:2861)     在   org.apache.spark.sql.Dataset $$ anonfun $头$ ​​1.适用(Dataset.scala:2150)     在   org.apache.spark.sql.Dataset $$ anonfun $头$ ​​1.适用(Dataset.scala:2150)     在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2842)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65)     在org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)at   org.apache.spark.sql.Dataset.head(Dataset.scala:2150)at   org.apache.spark.sql.Dataset.take(Dataset.scala:2363)at   org.apache.spark.sql.Dataset.showString(Dataset.scala:241)at at   sun.reflect.GeneratedMethodAccessor237.invoke(未知来源)at   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at   py4j.Gateway.invoke(Gateway.java:280)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:214)at   java.lang.Thread.run(Thread.java:748)引起:   org.apache.spark.api.python.PythonException:Traceback(最近的   最后打电话):文件   &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第177行,   在主要       process()File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第172行,   正在进行中       serializer.dump_stream(func(split_index,iterator),outfile)File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;, line   104,在       func = lambda _,it:map(mapper,it)File&#34;&#34;,第1行,在File中   &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第71行,在          返回lambda * a:f(* a)文件&#34;&#34;,第5行,在findClosestPreviousDate TypeError:&#39; NoneType&#39;对象不是   可迭代

     

在   org.apache.spark.api.python.PythonRunner $$匿名$ 1.read(PythonRDD.scala:193)     在   org.apache.spark.api.python.PythonRunner $$匿名$ 1(PythonRDD.scala:234)。     在   org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)     在   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:144)     在   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:87)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:108)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624)     ......还有1个

2 个答案:

答案 0 :(得分:0)

我尝试了你的udf,但它不断返回0(int)。

dlt = currdate-result # result and currdate are same
return dlt.days # days is int type

但是在创建udf时你已经指定了StringType。

findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())

因此我修改了findClosestPreviousDate函数,如有必要请进行更改。

>>> in_dates = ['2017-01-26', '2017-02-26', '2017-04-17']
>>>
>>> def findClosestPreviousDate(currdate, date_list=in_dates):
...     date_format = "%Y-%m-%d"
...     currdate = datetime.datetime.strptime(currdate, date_format)
...     date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
...     diff = map(lambda dt: (currdate - dt).days, date_list)
...     closestDate = min(filter(lambda days_diff: days_diff <= 0, diff))
...     return closestDate if closestDate else 0
...
>>> findClosestPreviousDate('2017-01-06')
-101

还将udf的返回类型设为IntegerType。通过这些修改,代码可以正常工作,但请验证更改是否正确。 PySpark udfs只能接受单个参数,有一个解决方法,请参考PySpark - Pass list as parameter to UDF

>>> df.show()
+----------+
|      date|
+----------+
|2017-01-06|
|2017-01-08|
+----------+

>>>
>>> in_dates = ['2017-01-26', '2017-02-26', '2017-04-17']
>>> def findClosestPreviousDate(currdate, date_list=in_dates):
...     date_format = "%Y-%m-%d"
...     currdate = datetime.datetime.strptime(currdate, date_format)
...     date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
...     diff = map(lambda dt: (currdate - dt).days, date_list)
...     closestDate = min(filter(lambda days_diff: days_diff <= 0, diff))
...     return closestDate if closestDate else 0
...
>>> findClosestPreviousDate('2017-01-06')
-101
>>>
>>> from pyspark.sql.types import IntegerType
>>> findClosestPreviousDateUDF = udf(findClosestPreviousDate, IntegerType())
>>> df.withColumn('closest_date', findClosestPreviousDateUDF(df['date'])).show()
+----------+------------+
|      date|closest_date|
+----------+------------+
|2017-01-06|        -101|
|2017-01-08|         -99|
+----------+------------+

希望这有帮助!

答案 1 :(得分:0)

我认为找出问题所在。这是我修改过的UDF。

def findClosestPreviousDate(currdate, date_str):
    date_format = "%Y-%m-%d"
    currdate = datetime.datetime.strptime(currdate, date_format)
    date_list = ''
    result = currdate
    if date_str is None:
        return date_str
    else:
        date_list = date_str.split('|')
    date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
    lowestdiff = 10000
    for dt in date_list:
        if(dt >= currdate):
            continue
        delta = currdate-dt
        diff = delta.days
        if(diff < lowestdiff):
            lowestdiff = diff
            result = dt
    dlt = currdate-result
    return dlt.days

NoneType错误是由于null值作为我知道的参数进入UDF。我想知道为什么在使用isNotNull()函数时没有过滤掉空值。

尝试了两个

findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
grouped_extend_df2 = grouped_extend_email_rec.withColumn('recency_eng', func.when(size(col("activity_arr")) > 0, findClosestPreviousDateUdf("expanded_datestr", "activity_arr")).otherwise(0))

findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
grouped_extend_df2 = grouped_extend_email_rec.withColumn('recency_eng', func.when(col("activity_arr").isNotNull(), findClosestPreviousDateUdf("expanded_datestr", "activity_arr")).otherwise(0))

但是当我在函数findClosestPreviousDate()中传递python函数中的NoneType时,如下所示

if date_str is None:
    return date_str
else:
    date_list = date_str.split('|')

它有效。