使用DenseVector行

时间:2018-03-12 08:06:37

标签: python pyspark user-defined-functions

我有一个数据框,其中DenseVectors为行:

df = spark.createDataFrame([(Vectors.dense([1,2,3]),),(Vectors.dense([3,4,5]),),(Vectors.dense([6,2,5]),)], ["a"])

我希望找到UDF每行的最大值。这就是我所做的:

findmax = F.udf(lambda x: max(x),DoubleType())
df_out = df_out.select('*',findmax(df_out['sensor_data']).alias('MAX'))

运行代码后,这是我收到的消息

  

追踪(最近一次呼叫最后一次):

     

文件"",第1行,in       df.select(' *',findmax(DF ['一个']))。显示()

     

文件   " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ pyspark \ SQL \ dataframe.py&#34 ;,   第336行,显示       print(self._jdf.showString(n,20))

     

文件   " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ py4j \ java_gateway.py&#34 ;,   第1133行,致电       回答,self.gateway_client,self.target_id,self.name)

     

文件   " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ pyspark \ SQL \ utils.py&#34 ;,   第63行,装饰       返回f(* a,** kw)

     

文件   " C:\ ProgramData \ Anaconda3 \ ENVS \ python2 \ lib中\站点包\ py4j \ protocol.py&#34 ;,   第319行,在get_return_value中       格式(target_id,"。",name),value)

     

Py4JJavaError:调用o785.showString时发生错误。 :   org.apache.spark.SparkException:作业因阶段失败而中止:   阶段67.0中的任务2失败1次,最近失败:丢失任务   阶段67.0中的2.0(TID 890,localhost,执行程序驱动程序):net.razorvine.pickle.PickleException:预期的零参数   构造ClassDict(for numpy.dtype)at   net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)     在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at   net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at   net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at   net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:156)     在   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:155)     在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)at at   scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)at   scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)at   org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知   来源)at   org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)     在   org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$不久$ 1.hasNext(WholeStageCodegenExec.scala:395)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:234)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:228)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:108)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)     at java.util.concurrent.ThreadPoolExecutor $ Worker.run(Unknown Source)     在java.lang.Thread.run(未知来源)

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1517)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1505)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1504)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814)     在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)at at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:336)     在   org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)     在   org.apache.spark.sql.Dataset.org $阿帕奇$火花$ SQL $数据集$$ collectFromPlan(Dataset.scala:2861)     在   org.apache.spark.sql.Dataset $$ anonfun $头$ ​​1.适用(Dataset.scala:2150)     在   org.apache.spark.sql.Dataset $$ anonfun $头$ ​​1.适用(Dataset.scala:2150)     在org.apache.spark.sql.Dataset $$ anonfun $ 55.apply(Dataset.scala:2842)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:65)     在org.apache.spark.sql.Dataset.withAction(Dataset.scala:2841)at   org.apache.spark.sql.Dataset.head(Dataset.scala:2150)at   org.apache.spark.sql.Dataset.take(Dataset.scala:2363)at   org.apache.spark.sql.Dataset.showString(Dataset.scala:241)at at   sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(未知来源)at   sun.reflect.DelegatingMethodAccessorImpl.invoke(未知来源)at   java.lang.reflect.Method.invoke(未知来源)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at   py4j.Gateway.invoke(Gateway.java:280)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:214)at   java.lang.Thread.run(未知来源)引起:   net.razorvine.pickle.PickleException:预期的零参数   构造ClassDict(for numpy.dtype)at   net.razorvine.pickle.objects.ClassDictConstructor.construct(ClassDictConstructor.java:23)     在net.razorvine.pickle.Unpickler.load_reduce(Unpickler.java:707)at   net.razorvine.pickle.Unpickler.dispatch(Unpickler.java:175)at at   net.razorvine.pickle.Unpickler.load(Unpickler.java:99)at   net.razorvine.pickle.Unpickler.loads(Unpickler.java:112)at at   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:156)     在   org.apache.spark.sql.execution.python.BatchEvalPythonExec $$ anonfun $ doExecute $ 1 $$ anonfun $ $申请7.适用(BatchEvalPythonExec.scala:155)     在scala.collection.Iterator $$ anon $ 12.nextCur(Iterator.scala:434)at at   scala.collection.Iterator $$ anon $ 12.hasNext(Iterator.scala:440)at   scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:408)at   org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知   来源)at   org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)     在   org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$不久$ 1.hasNext(WholeStageCodegenExec.scala:395)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:234)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:228)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:108)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338)     at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)     at java.util.concurrent.ThreadPoolExecutor $ Worker.run(Unknown Source)   还有1个

我不知道为什么这不起作用:我发现如果行只是浮点数而不是DenseVectors它会起作用,而python函数max接受DenseVectors作为输入。

1 个答案:

答案 0 :(得分:3)

出现此错误的原因是,当udf实际返回float时,您已将udf的返回类型定义为numpy.float64。 pyspark将floatnumpy.float64视为不同类型。 像这样将返回类型转换为float findmax = F.udf(lambda x: float(max(x)),DoubleType())