我有一个数据框,其中DenseVectors为行:
df = spark.createDataFrame([(Vectors.dense([1,2,3]),),(Vectors.dense([3,4,5]),),(Vectors.dense([6,2,5]),)], ["a"])
我希望找到UDF每行的最大值。这就是我所做的:
findmax = F.udf(lambda x: max(x),DoubleType())
df_out = df_out.select('*',findmax(df_out['sensor_data']).alias('MAX'))
运行代码后,这是我收到的消息
追踪(最近一次呼叫最后一次):
文件"",第1行,in df.select(' *',findmax(DF ['一个']))。显示()
文件 " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ pyspark \ SQL \ dataframe.py&#34 ;, 第336行,显示 print(self._jdf.showString(n,20))
文件 " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ py4j \ java_gateway.py&#34 ;, 第1133行,致电 回答,self.gateway_client,self.target_id,self.name)
文件 " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ pyspark \ SQL \ utils.py&#34 ;, 第63行,装饰 返回f(* a,** kw)
文件 " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ py4j \ protocol.py&#34 ;, 第319行,在get_return_value中 格式(target_id,"。",name),value)
Py4JJavaError:调用o785.showString时发生错误。 : org.apache.spark.SparkException:作业因阶段失败而中止: 阶段67.0中的任务2失败1次,最近失败:丢失任务 阶段67.0中的2.0(TID 890,localhost,执行程序驱动程序):net.razorvine.pickle.PickleException:预期的零参数 构造ClassDict(for numpy.dtype)at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) 在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:156) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:155) 在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)at at scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)at scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知 来源)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 在 org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$不久$ 1.hasNext(WholeStageCodegenExec.scala:395) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:234) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:228) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at org.apache.spark.scheduler.Task.run(Task.scala:108)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(Unknown Source) 在java.lang.Thread.run(未知来源)
驱动程序堆栈跟踪:at org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1517) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1505) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1504) 在 scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814) 在scala.Option.foreach(Option.scala:257)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)at at org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336) 在 org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38) 在 org.apache.spark.sql.Dataset.org $阿帕奇$火花$ SQL $数据集$$ collectFromPlan(Dataset.scala:2861) 在 org.apache.spark.sql.Dataset $$ anonfun $头$ 1.适用(Dataset.scala:2150) 在 org.apache.spark.sql.Dataset $$ anonfun $头$ 1.适用(Dataset.scala:2150) 在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2842) 在 org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65) 在org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)at org.apache.spark.sql.Dataset.head(Dataset.scala:2150)at org.apache.spark.sql.Dataset.take(Dataset.scala:2363)at org.apache.spark.sql.Dataset.showString(Dataset.scala:241)at at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(未知来源)at sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)at java.lang.reflect.Method.invoke(未知来源)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at py4j.Gateway.invoke(Gateway.java:280)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang.Thread.run(未知来源)引起: net.razorvine.pickle.PickleException:预期的零参数 构造ClassDict(for numpy.dtype)at net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23) 在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:156) 在 org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:155) 在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)at at scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)at scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)at org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知 来源)at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43) 在 org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$不久$ 1.hasNext(WholeStageCodegenExec.scala:395) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:234) 在 org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:228) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827) 在 org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827) 在 org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at org.apache.spark.scheduler.Task.run(Task.scala:108)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source) at java.util.concurrent.ThreadPoolExecutor $ Worker.run(Unknown Source) 还有1个
我不知道为什么这不起作用:我发现如果行只是浮点数而不是DenseVectors它会起作用,而python函数max接受DenseVectors作为输入。
答案 0 :(得分:3)
出现此错误的原因是,当udf实际返回float
时,您已将udf的返回类型定义为numpy.float64
。 pyspark将float
和numpy.float64
视为不同类型。
像这样将返回类型转换为float
findmax = F.udf(lambda x: float(max(x)),DoubleType())