执行此代码时,我得到一个SparkException:
@pandas_udf('double', PandasUDFType.SCALAR)
def pandas_plus_one(a):
return a + 1
df.withColumn('v2', pandas_plus_one(df.a)).show()
这是我的数据集的架构
|-- a: double (nullable = true)
|-- b: double (nullable = true)
|-- c: double (nullable = true)
|-- d: double (nullable = true)
|-- TARGET: string (nullable = true)
这是我执行上述代码片段时遇到的错误
Py4JJavaError:调用o2933.showString时发生错误。 : org.apache.spark.SparkException:由于阶段失败,作业中止了: 阶段475.0中的任务0失败4次,最近一次失败:丢失的任务 在475.0阶段为0.3(TID 7841,s6.congolop.com,执行程序2):java.lang.IllegalArgumentException at java.nio.ByteBuffer.allocate(ByteBuffer.java:334)在 org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64) 在 org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104) 在 org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128) 在 org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) 在 org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) 在 org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) 在 org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:161) 在 org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:121) 在 org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:284) 在 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 在 org.apache.spark.sql.execution.python.ArrowEvalPythonExec $$ anon $ 2。(ArrowEvalPythonExec.scala:90) 在 org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88) 在 org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:131) 在 org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:93) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在 org.apache.spark.scheduler.Task.run(Task.scala:109)在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在 java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) 在java.lang.Thread.run(Thread.java:745)
驱动程序堆栈跟踪:位于 org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1651) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1639) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1638) 在 scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1638) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831) 在scala.Option.foreach(Option.scala:257)在 org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1872) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1821) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1810) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2034)处 org.apache.spark.SparkContext.runJob(SparkContext.scala:2055)在 org.apache.spark.SparkContext.runJob(SparkContext.scala:2074)在 org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:363) 在 org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) 在 org.apache.spark.sql.Dataset.org $ apache $ spark $ sql $ Dataset $$ collectFromPlan(Dataset.scala:3278) 在 org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2489) 在 org.apache.spark.sql.Dataset $$ anonfun $ head $ 1.apply(Dataset.scala:2489) 在org.apache.spark.sql.Dataset $$ anonfun $ 52.apply(Dataset.scala:3259) 在 org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:77) 在org.apache.spark.sql.Dataset.withAction(Dataset.scala:3258)处 org.apache.spark.sql.Dataset.head(Dataset.scala:2489)在 org.apache.spark.sql.Dataset.take(Dataset.scala:2703)在 org.apache.spark.sql.Dataset.showString(Dataset.scala:254)在 sun.reflect.GeneratedMethodAccessor84.invoke(未知源)位于 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498)在 py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)在 py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)在 py4j.Gateway.invoke(Gateway.java:282)在 py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)处 py4j.GatewayConnection.run(GatewayConnection.java:238)在 java.lang.Thread.run(Thread.java:745)由以下原因引起: java.lang.IllegalArgumentException在 java.nio.ByteBuffer.allocate(ByteBuffer.java:334)在 org.apache.arrow.vector.ipc.message.MessageChannelReader.readNextMessage(MessageChannelReader.java:64) 在 org.apache.arrow.vector.ipc.message.MessageSerializer.deserializeSchema(MessageSerializer.java:104) 在 org.apache.arrow.vector.ipc.ArrowStreamReader.readSchema(ArrowStreamReader.java:128) 在 org.apache.arrow.vector.ipc.ArrowReader.initialize(ArrowReader.java:181) 在 org.apache.arrow.vector.ipc.ArrowReader.ensureInitialized(ArrowReader.java:172) 在 org.apache.arrow.vector.ipc.ArrowReader.getVectorSchemaRoot(ArrowReader.java:65) 在 org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:161) 在 org.apache.spark.sql.execution.python.ArrowPythonRunner $$ anon $ 1.read(ArrowPythonRunner.scala:121) 在 org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:284) 在 org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37) 在 org.apache.spark.sql.execution.python.ArrowEvalPythonExec $$ anon $ 2。(ArrowEvalPythonExec.scala:90) 在 org.apache.spark.sql.execution.python.ArrowEvalPythonExec.evaluate(ArrowEvalPythonExec.scala:88) 在 org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:131) 在 org.apache.spark.sql.execution.python.EvalPythonExec $$ anonfun $ doExecute $ 1.apply(EvalPythonExec.scala:93) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitions $ 1 $$ anonfun $ apply $ 23.apply(RDD.scala:801) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:49) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:324) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:288)处 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)在 org.apache.spark.scheduler.Task.run(Task.scala:109)在 org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:345) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) 在
java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:617) ...还有1个
(,Py4JJavaError(u'An error 发生在调用o2933.showString。\ n',JavaObject id = o2934)时, )
任何想法我该如何解决?