我已经编写了一个使用python在spark中使用的UDF。这个功能需要 一个日期(在字符串中,例如' 2017-01-06')和 一个字符串数组(例如:[2017-01-26,2017-02-26,2017-04-17]) 并返回自上次最近的日期以来的#days。 UDF是
def findClosestPreviousDate(currdate, date_list):
date_format = "%Y-%m-%d"
currdate = datetime.datetime.strptime(currdate, date_format)
result = currdate
date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
lowestdiff = 10000
for dt in date_list:
if(dt >= currdate):
continue
delta = currdate-dt
diff = delta.days
if(diff < lowestdiff):
lowestdiff = diff
result = dt
dlt = currdate-result
return dlt.days
findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
我称之为
findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
grouped_extend_df2 = grouped_extend_email_rec.withColumn('recency_eng', func.when(size(col("activity_arr")) > 0, findClosestPreviousDateUdf("expanded_datestr", "activity_arr")).otherwise(0))
即使我删除了列中的所有空值&#34; activity_arr&#34;我继续得到 NoneType 错误。尝试在功能内部进行应急处理(仍然相同)。
我们是否有更好的方法在运行时从UDF捕获错误记录(可能正在使用累加器左右,我看到很少有人使用scala尝试相同)
ERROR:
----------------------------------------------- ---------------------------- Py4JJavaError Traceback(最近的电话 最后)in() ----&GT; 1 grouped_extend_df2.show()
/usr/lib/spark/python/pyspark/sql/dataframe.pyc in show(self,n, 截短) 334&#34;&#34;&#34; 335 if isinstance(truncate,bool)和truncate: - &GT; 336打印(self._jdf.showString(n,20)) 337其他: 338 print(self._jdf.showString(n,int(truncate)))
/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py in 调用(self,* args)1131 answer = self.gateway_client.send_command(command)1132 return_value = get_return_value( - &GT; 1133回答,self.gateway_client,self.target_id,self.name)1134 1135 for temp_arg in temp_args:
deco中的/usr/lib/spark/python/pyspark/sql/utils.pyc(* a,** kw) 61 def deco(* a,** kw): 62尝试: ---&GT; 63返回f(* a,** kw) 64除了py4j.protocol.Py4JJavaError为e: 65 s = e.java_exception.toString()/usr/lib/spark/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py in get_return_value(answer,gateway_client,target_id,name) 317引发Py4JJavaError( 318&#34;调用{0} {1} {2}时发生错误。\ n&#34;。 - &GT; 319格式(target_id,&#34;。&#34;,名称),值) 320其他: 321提出Py4JError(
Py4JJavaError:调用o1111.showString时发生错误。 : org.apache.spark.SparkException:作业因阶段失败而中止: 阶段315.0中的任务0失败1次,最近失败:丢失任务 阶段315.0中的0.0(TID 18390,localhost,执行程序驱动程序):org.apache.spark.api.python.PythonException:Traceback(最新版本) 最后打电话):文件 &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第177行, 在主要 process()File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第172行, 正在进行中 serializer.dump_stream(func(split_index,iterator),outfile)File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;, line 104,在 func = lambda _,it:map(mapper,it)File&#34;&#34;,第1行,在File中 &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第71行,在 返回lambda * a:f(* a)文件&#34;&#34;,第5行,在findClosestPreviousDate中TypeError:&#39; NoneType&#39;对象不是 可迭代
在 org.apache.spark.api.python.PythonRunner $$匿名$ 1.read(PythonRDD.scala:193) 在 org.apache.spark.api.python.PythonRunner $$匿名$ 1(PythonRDD.scala:234)。 在 org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:144) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:87) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at org.apache.spark.scheduler.Task.run(Task.scala:108)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)
驱动程序堆栈跟踪:at org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1517) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1505) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1504) 在 scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814) 在scala.Option.foreach(Option.scala:257)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)at at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) 在 org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) 在 org.apache.spark.sql.Dataset.org $阿帕奇$火花$ SQL $数据集$$ collectFromPlan(Dataset.scala:2861) 在 org.apache.spark.sql.Dataset $$ anonfun $头$ 1.适用(Dataset.scala:2150) 在 org.apache.spark.sql.Dataset $$ anonfun $头$ 1.适用(Dataset.scala:2150) 在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2842) 在 org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65) 在org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)at at sun.reflect.GeneratedMethodAccessor237.invoke(未知来源)at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at py4j.Gateway.invoke(Gateway.java:280)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang.Thread.run(Thread.java:748)引起: org.apache.spark.api.python.PythonException:Traceback(最近的 最后打电话):文件 &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第177行, 在主要 process()File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第172行, 正在进行中 serializer.dump_stream(func(split_index,iterator),outfile)File&#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;, line 104,在 func = lambda _,it:map(mapper,it)File&#34;&#34;,第1行,在File中 &#34; /usr/lib/spark/python/lib/pyspark.zip/pyspark/worker.py" ;,第71行,在 返回lambda * a:f(* a)文件&#34;&#34;,第5行,在findClosestPreviousDate TypeError:&#39; NoneType&#39;对象不是 可迭代
在 org.apache.spark.api.python.PythonRunner $$匿名$ 1.read(PythonRDD.scala:193) 在 org.apache.spark.api.python.PythonRunner $$匿名$ 1(PythonRDD.scala:234)。 在 org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:144) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1.适用(BatchEvalPythonExec.scala:87) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ $申请23.apply(RDD.scala:797) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at org.apache.spark.scheduler.Task.run(Task.scala:108)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624) ......还有1个
答案 0 :(得分:0)
我尝试了你的udf,但它不断返回0(int)。
dlt = currdate-result # result and currdate are same
return dlt.days # days is int type
但是在创建udf时你已经指定了StringType。
findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
因此我修改了findClosestPreviousDate
函数,如有必要请进行更改。
>>> in_dates = ['2017-01-26', '2017-02-26', '2017-04-17']
>>>
>>> def findClosestPreviousDate(currdate, date_list=in_dates):
... date_format = "%Y-%m-%d"
... currdate = datetime.datetime.strptime(currdate, date_format)
... date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
... diff = map(lambda dt: (currdate - dt).days, date_list)
... closestDate = min(filter(lambda days_diff: days_diff <= 0, diff))
... return closestDate if closestDate else 0
...
>>> findClosestPreviousDate('2017-01-06')
-101
还将udf的返回类型设为IntegerType
。通过这些修改,代码可以正常工作,但请验证更改是否正确。 PySpark udfs只能接受单个参数,有一个解决方法,请参考PySpark - Pass list as parameter to UDF
>>> df.show()
+----------+
| date|
+----------+
|2017-01-06|
|2017-01-08|
+----------+
>>>
>>> in_dates = ['2017-01-26', '2017-02-26', '2017-04-17']
>>> def findClosestPreviousDate(currdate, date_list=in_dates):
... date_format = "%Y-%m-%d"
... currdate = datetime.datetime.strptime(currdate, date_format)
... date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
... diff = map(lambda dt: (currdate - dt).days, date_list)
... closestDate = min(filter(lambda days_diff: days_diff <= 0, diff))
... return closestDate if closestDate else 0
...
>>> findClosestPreviousDate('2017-01-06')
-101
>>>
>>> from pyspark.sql.types import IntegerType
>>> findClosestPreviousDateUDF = udf(findClosestPreviousDate, IntegerType())
>>> df.withColumn('closest_date', findClosestPreviousDateUDF(df['date'])).show()
+----------+------------+
| date|closest_date|
+----------+------------+
|2017-01-06| -101|
|2017-01-08| -99|
+----------+------------+
希望这有帮助!
答案 1 :(得分:0)
我认为找出问题所在。这是我修改过的UDF。
def findClosestPreviousDate(currdate, date_str):
date_format = "%Y-%m-%d"
currdate = datetime.datetime.strptime(currdate, date_format)
date_list = ''
result = currdate
if date_str is None:
return date_str
else:
date_list = date_str.split('|')
date_list = [datetime.datetime.strptime(x, date_format) for x in date_list if x != None]
lowestdiff = 10000
for dt in date_list:
if(dt >= currdate):
continue
delta = currdate-dt
diff = delta.days
if(diff < lowestdiff):
lowestdiff = diff
result = dt
dlt = currdate-result
return dlt.days
NoneType错误是由于null值作为我知道的参数进入UDF。我想知道为什么在使用isNotNull()函数时没有过滤掉空值。
尝试了两个
findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
grouped_extend_df2 = grouped_extend_email_rec.withColumn('recency_eng', func.when(size(col("activity_arr")) > 0, findClosestPreviousDateUdf("expanded_datestr", "activity_arr")).otherwise(0))
和
findClosestPreviousDateUdf = udf(findClosestPreviousDate,StringType())
grouped_extend_df2 = grouped_extend_email_rec.withColumn('recency_eng', func.when(col("activity_arr").isNotNull(), findClosestPreviousDateUdf("expanded_datestr", "activity_arr")).otherwise(0))
但是当我在函数findClosestPreviousDate()中传递python函数中的NoneType时,如下所示
if date_str is None:
return date_str
else:
date_list = date_str.split('|')
它有效。