为什么pyspark中的Pandas_udf出现以下错误

时间:2019-11-29 08:03:51

标签: apache-spark machine-learning pyspark pyspark-sql pyspark-dataframes

执行此代码时,我得到一个SparkException:

@pandas_udf('double', PandasUDFType.SCALAR)
def pandas_plus_one(a):
    return a + 1
df.withColumn('v2', pandas_plus_one(df.a)).show()
这是我的数据集的架构
 |-- a: double (nullable = true)
 |-- b: double (nullable = true)
 |-- c: double (nullable = true)
 |-- d: double (nullable = true)
 |-- TARGET: string (nullable = true)
这是我执行上述代码片段时遇到的错误
  

Py4JJavaError:调用o2933.showString时发生错误。 :   org.apache.spark.SparkException:由于阶段失败,作业中止了:   阶段475.0中的任务0失败4次,最近一次失败:丢失的任务   在475.0阶段为0.3(TID 7841,s6.congolop.com,执行程序2):java.lang.IllegalArgumentException at   java.nio.ByteBuffer.allocate(ByteBuffer.java:334)在   org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)     在   org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104)     在   org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128)     在   org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)     在   org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)     在   org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)     在   org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:161)     在   org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:121)     在   org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:284)     在   org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)     在   org.apache.spark.sql.execution.python.ArrowEvalPythonExec $$ anon $ 2。(ArrowEvalPythonExec.scala:90)     在   org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)     在   org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:131)     在   org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:93)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在   org.apache.spark.scheduler.Task.run(Task.scala:109)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:745)

     

驱动程序堆栈跟踪:位于   org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1651)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1639)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1638)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831)     在scala.Option.foreach(Option.scala:257)在   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)处   org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)在   org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)在   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363)     在   org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)     在   org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala:3278)     在   org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2489)     在   org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2489)     在org.apache.spark.sql.Dataset $$ anonfun $ 52.apply(Dataset.scala:3259)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:77)     在org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)处   org.apache.spark.sql.Dataset.head(Dataset.scala:2489)在   org.apache.spark.sql.Dataset.take(Dataset.scala:2703)在   org.apache.spark.sql.Dataset.showString(Dataset.scala:254)在   sun.reflect.GeneratedMethodAccessor84.invoke(未知源)位于   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)在   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在   py4j.Gateway.invoke(Gateway.java:282)在   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)处   py4j.GatewayConnection.run(GatewayConnection.java:238)在   java.lang.Thread.run(Thread.java:745)由以下原因引起:   java.lang.IllegalArgumentException在   java.nio.ByteBuffer.allocate(ByteBuffer.java:334)在   org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64)     在   org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104)     在   org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128)     在   org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181)     在   org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172)     在   org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65)     在   org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:161)     在   org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:121)     在   org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:284)     在   org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)     在   org.apache.spark.sql.execution.python.ArrowEvalPythonExec $$ anon $ 2。(ArrowEvalPythonExec.scala:90)     在   org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88)     在   org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:131)     在   org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:93)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在   org.apache.spark.scheduler.Task.run(Task.scala:109)在   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在

     
    

java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617)     ...还有1个

  
     

(,Py4JJavaError(u'An error   发生在调用o2933.showString。\ n',JavaObject id = o2934)时,   )

任何想法我该如何解决?

0 个答案:

没有答案