Spark VectorAssembler错误 - PySpark 2.3 - Python

时间:2018-03-22 02:42:27

标签: python apache-spark pyspark spark-dataframe

我正在使用pySpark 2.3.0,并且我创建了一个非常简单的Spark数据帧来测试VectorAssembler的功能。这是一个较大的数据框的子集,我只选择了几个数字(双数据类型)列:

>>>cols = ['index','host_listings_count','neighbourhood_group_cleansed',\
        'bathrooms','bedrooms','beds','square_feet', 'guests_included',\
        'review_scores_rating']
>>>test = df[cols]
>>>test.take(3)
  

[Row(index = 0,host_listings_count = 1,   neighbourhood_group_cleansed =无,浴室= 1.5,卧室= 2.0,   床= 3.0,square_feet =无,guests_included = 1,   review_scores_rating = 100.0),Row(index = 1,host_listings_count = 1,   neighbourhood_group_cleansed =无,浴室= 1.5,卧室= 2.0,   床= 3.0,square_feet =无,guests_included = 1,   review_scores_rating = 100.0),Row(index = 2,host_listings_count = 1,   neighbourhood_group_cleansed =无,浴室= 1.5,卧室= 2.0,   床= 3.0,square_feet =无,guests_included = 1,   review_scores_rating = 100.0)]

从上面可以看出,这个Spark数据帧没有任何问题。所以我然后创建如下所示的汇编程序并获得显示的错误。什么可能出错?

>>>from pyspark.ml.feature import VectorAssembler
>>>assembler = VectorAssembler(inputCols=cols, outputCol="features")
>>>output = assembler.transform(test)
>>>output.take(3)
  

Py4JJavaError:调用o279.collectToPython时发生错误。 :   org.apache.spark.SparkException:作业因阶段失败而中止:   阶段5.0中的任务0失败1次,最近失败:丢失任务0.0   在阶段5.0(TID 10,localhost,执行程序驱动程序):   org.apache.spark.SparkException:无法执行用户定义   功能($ anonfun $ 3:   (结构)   =>在org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知)   来源)at   org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)     在   org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$不久$ 1.hasNext(WholeStageCodegenExec.scala:377)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:231)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:225)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:99)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     在java.lang.Thread.run(Thread.java:748)引起:   org.apache.spark.SparkException:要汇编的值不能为null。     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ $装配1.适用(VectorAssembler.scala:160)     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ $组装1.适用(VectorAssembler.scala:143)     在   scala.collection.IndexedSeqOptimized $ class.foreach(IndexedSeqOptimized.scala:33)     在   scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)     在   org.apache.spark.ml.feature.VectorAssembler $ .assemble(VectorAssembler.scala:143)     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ 3.apply(VectorAssembler.scala:99)     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ 3.apply(VectorAssembler.scala:98)     ......还有16个

     

驱动程序堆栈跟踪:at   org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1435)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1423)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1422)     在   scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)     在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1422)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:802)     在   org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:802)     在scala.Option.foreach(Option.scala:257)at   org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:802)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1650)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1605)     在   org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1594)     在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)     在   org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:628)     在org.apache.spark.SparkContext.runJob(SparkContext.scala:1925)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1938)at at   org.apache.spark.SparkContext.runJob(SparkContext.scala:1951)at at   org.apache.spark.sql.execution.SparkPlan.executeTake(SparkPlan.scala:333)     在   org.apache.spark.sql.execution.CollectLimitExec.executeCollect(limit.scala:38)     在   org.apache.spark.sql.Dataset $$ anonfun $ collectToPython $ 1.适用$ MCI $ SP(Dataset.scala:2768)     在   org.apache.spark.sql.Dataset $$ anonfun $ collectToPython $ 1.适用(Dataset.scala:2765)     在   org.apache.spark.sql.Dataset $$ anonfun $ collectToPython $ 1.适用(Dataset.scala:2765)     在   org.apache.spark.sql.execution.SQLExecution $ .withNewExecutionId(SQLExecution.scala:57)     在   org.apache.spark.sql.Dataset.withNewExecutionId(Dataset.scala:2788)     在org.apache.spark.sql.Dataset.collectToPython(Dataset.scala:2765)     at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at   sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)     在   sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)     在java.lang.reflect.Method.invoke(Method.java:498)at   py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at   py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at   py4j.Gateway.invoke(Gateway.java:280)at   py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)     在py4j.commands.CallCommand.execute(CallCommand.java:79)at   py4j.GatewayConnection.run(GatewayConnection.java:214)at   java.lang.Thread.run(Thread.java:748)引起:   org.apache.spark.SparkException:无法执行用户定义   功能($ anonfun $ 3:   (结构)   =>在org.apache.spark.sql.catalyst.expressions.GeneratedClass $ GeneratedIterator.processNext(未知)   来源)at   org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)     在   org.apache.spark.sql.execution.WholeStageCodegenExec $$ anonfun $ 8 $$不久$ 1.hasNext(WholeStageCodegenExec.scala:377)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:231)     在   org.apache.spark.sql.execution.SparkPlan $$ anonfun $ 2.适用(SparkPlan.scala:225)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.RDD $$ anonfun $ mapPartitionsInternal $ 1 $$ anonfun $ $申请25.apply(RDD.scala:827)     在   org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)     在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)     在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at   org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at   org.apache.spark.scheduler.Task.run(Task.scala:99)at   org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:322)     在   java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)     在   java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:617)     ... 1更多引起:org.apache.spark.SparkException:值为   汇编不能为空。在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ $装配1.适用(VectorAssembler.scala:160)     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ $组装1.适用(VectorAssembler.scala:143)     在   scala.collection.IndexedSeqOptimized $ class.foreach(IndexedSeqOptimized.scala:33)     在   scala.collection.mutable.WrappedArray.foreach(WrappedArray.scala:35)     在   org.apache.spark.ml.feature.VectorAssembler $ .assemble(VectorAssembler.scala:143)     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ 3.apply(VectorAssembler.scala:99)     在   org.apache.spark.ml.feature.VectorAssembler $$ anonfun $ 3.apply(VectorAssembler.scala:98)     ......还有16个

1 个答案:

答案 0 :(得分:3)

您发布的堆栈跟踪提到问题是由正在组装的列中的空值引起的。

您需要在null列中处理cols值。 在调用transform之前尝试test.fillna(0, subset=cols),或者在这些列中过滤掉具有空值的行。