rdd_data = sc.parallelize([ list(r)[2:-1] for r in data.itertuples()])
rdd_data.count()
使用独立群集我面临以下错误。 windows 7 python 3.6
给我错误:
〜\ Anaconda2 \ envs \ py36 \ lib \ site-packages \ py4j \ protocol.py in get_return_value(answer,gateway_client,target_id,name) 318提出Py4JJavaError( 319"调用{0} {1} {2}时发生错误。\ n"。 - > 320格式(target_id,"。",名称),值) 321其他: 322提出Py4JError(
Py4JJavaError:调用时发生错误 Z:org.apache.spark.api.python.PythonRDD.collectAndServe。 : org.apache.spark.SparkException:作业因阶段失败而中止: 阶段0.0中的任务0失败1次,最近失败:丢失任务0.0 在阶段0.0(TID 0,localhost,执行程序驱动程序): org.apache.spark.SparkException:Python worker没有重新连接 时间在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138) 在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67) 在org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) 在 org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) 在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at org.apache.spark.scheduler.Task.run(Task.scala:108)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624) 在java.lang.Thread.run(Thread.java:748)引起: java.net.SocketTimeoutException:接受超时时间 java.net.DualStackPlainSocketImpl.waitForNewConnection(本机方法) 在 java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135) 在 java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) 在java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)at java.net.ServerSocket.implAccept(ServerSocket.java:545)at java.net.ServerSocket.accept(ServerSocket.java:513)at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133) ......还有12个
驱动程序堆栈跟踪:at org.apache.spark.scheduler.DAGScheduler.org $阿帕奇$火花$ $调度$$ DAGScheduler failJobAndIndependentStages(DAGScheduler.scala:1517) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1505) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.适用(DAGScheduler.scala:1504) 在 scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59) 在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.适用(DAGScheduler.scala:814) 在scala.Option.foreach(Option.scala:257)at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687) 在 org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676) 在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48) 在 org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630) 在org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)at at org.apache.spark.SparkContext.runJob(SparkContext.scala:2094)at at org.apache.spark.rdd.RDD $$ anonfun $ collect $ 1.apply(RDD.scala:936)at at org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151) 在 org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112) 在org.apache.spark.rdd.RDD.withScope(RDD.scala:362)at org.apache.spark.rdd.RDD.collect(RDD.scala:935)at org.apache.spark.api.python.PythonRDD $ .collectAndServe(PythonRDD.scala:467) 在 org.apache.spark.api.python.PythonRDD.collectAndServe(PythonRDD.scala) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) 在 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) 在java.lang.reflect.Method.invoke(Method.java:498)at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)at at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)at py4j.Gateway.invoke(Gateway.java:280)at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132) 在py4j.commands.CallCommand.execute(CallCommand.java:79)at py4j.GatewayConnection.run(GatewayConnection.java:214)at java.lang.Thread.run(Thread.java:748)引起: org.apache.spark.SparkException:Python worker没有重新连接 时间在 org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:138) 在 org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:67) 在org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:117) 在 org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128) 在org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63) 在org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323) 在org.apache.spark.rdd.RDD.iterator(RDD.scala:287)at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)at at org.apache.spark.scheduler.Task.run(Task.scala:108)at org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:338) 在 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) 在 java.util.concurrent.ThreadPoolExecutor中的$ Worker.run(ThreadPoolExecutor.java:624) ... 1更多引起:java.net.SocketTimeoutException:接受定时 out at java.net.DualStackPlainSocketImpl.waitForNewConnection(Native 方法)at java.net.DualStackPlainSocketImpl.socketAccept(DualStackPlainSocketImpl.java:135) 在 java.net.AbstractPlainSocketImpl.accept(AbstractPlainSocketImpl.java:409) 在java.net.PlainSocketImpl.accept(PlainSocketImpl.java:199)at java.net.ServerSocket.implAccept(ServerSocket.java:545)at java.net.ServerSocket.accept(ServerSocket.java:513)at org.apache.spark.api.python.PythonWorkerFactory.createSimpleWorker(PythonWorkerFactory.scala:133) ......还有12个