RDD无法在pyspark中保存为文本文件

时间:2019-06-12 05:20:28

标签: python python-3.x apache-spark hadoop pyspark

在运行PySpark的AWS EC2实例上使用以下命令。

final_rdd.coalesce(1).saveAsTextFile('<Location for saving file>')

该命令失败,并显示以下日志。

  
    

[阶段1:>(0 + 1)/ 1] 19/06/12 05:08:41 WARN TaskSetManager:在阶段1.0中丢失了任务0.0(TID 7,ip-10-145-62-182.ec2 .internal,执行程序2):org.apache.spark.SparkException:写入行时任务失败             在org.apache.spark.internal.io.SparkHadoopWriter $ .org $ apache $ spark $ internal $ io $ SparkHadoopWriter $$ executeTask(SparkHadoopWriter.scala:155)处             在org.apache.spark.internal.io.SparkHadoopWriter $$ anonfun $ 3.apply(SparkHadoopWriter.scala:83)             在org.apache.spark.internal.io.SparkHadoopWriter $$ anonfun $ 3.apply(SparkHadoopWriter.scala:78)             在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)             在org.apache.spark.scheduler.Task.run(Task.scala:121)             在org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:402)             在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)             在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:408)             在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)             在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)             在java.lang.Thread.run(Thread.java:748)     由以下原因引起:org.apache.spark.api.python.PythonException:追溯(最近一次呼叫过去):       主文件262行“ /mnt/yarn/usercache/hadoop/appcache/application_1556865500911_0446/container_1556865500911_0446_01_000003/pyspark.zip/pyspark/worker.py”         (“%d。%d”%sys.version_info [:2],版本))     例外:worker中的Python与驱动程序3.5中的Python版本不同,PySpark无法以其他次要版本运行。请检查环境变量PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON是否已正确设置。             在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.handlePythonException(PythonRunner.scala:452)             在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:588)             在org.apache.spark.api.python.PythonRunner $$ anon $ 1.read(PythonRunner.scala:571)             在org.apache.spark.api.python.BasePythonRunner $ ReaderIterator.hasNext(PythonRunner.scala:406)             在org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:37)             在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:409)             在scala.collection.Iterator $$ anon $ 11.hasNext(Iterator.scala:409)             在org.apache.spark.internal.io.SparkHadoopWriter $$ anonfun $ 4.apply(SparkHadoopWriter.scala:128)             在org.apache.spark.internal.io.SparkHadoopWriter $$ anonfun $ 4.apply(SparkHadoopWriter.scala:127)             在org.apache.spark.util.Utils $ .tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1394)             在org.apache.spark.internal.io.SparkHadoopWriter $ .org $ apache $ spark $ internal $ io $ SparkHadoopWriter $$ executeTask(SparkHadoopWriter.scala:139)处             ...还有10个

         

19/06/12 05:08:41错误TaskSetManager:阶段1.0中的任务0失败4次;正在中止工作

         

19/06/12 05:08:41错误SparkHadoopWriter:正在中止作业job_20190612050833_0014。     org.apache.spark.SparkException:由于阶段失败而导致作业中止:阶段1.0中的任务0失败4次,最近一次失败:阶段1.0中的任务0.3丢失(TID 10,ip-10-145-62-182.ec2。内部,执行程序2):org.apache.spark.SparkException:写入行时任务失败             在org.apache.spark.internal.io.SparkHadoopWriter $ .org $ apache $ spark $ internal $ io $ SparkHadoopWriter $$ executeTask(SparkHadoopWriter.scala:155)处             在org.apache.spark.internal.io.SparkHadoopWriter $$ anonfun $ 3.apply(SparkHadoopWriter.scala:83)             在org.apache.spark.internal.io.SparkHadoopWriter $$ anonfun $ 3.apply(SparkHadoopWriter.scala:78)             在org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)             在org.apache.spark.scheduler.Task.run(Task.scala:121)             在org.apache.spark.executor.Executor $ TaskRunner $$ anonfun $ 10.apply(Executor.scala:402)             在org.apache.spark.util.Utils $ .tryWithSafeFinally(Utils.scala:1360)             在org.apache.spark.executor.Executor $ TaskRunner.run(Executor.scala:408)             在java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)             在java.util.concurrent.ThreadPoolExecutor $ Worker.run(ThreadPoolExecutor.java:624)             在java.lang.Thread.run(Thread.java:748)     由以下原因引起:org.apache.spark.api.python.PythonException:追溯(最近一次呼叫过去):       主文件262行“ /mnt/yarn/usercache/hadoop/appcache/application_1556865500911_0446/container_1556865500911_0446_01_000003/pyspark.zip/pyspark/worker.py”         (“%d。%d”%sys.version_info [:2],版本))     例外:worker中的Python与驱动程序3.5中的Python版本不同,PySpark无法以其他次要版本运行。请检查环境变量PYSPARK_PYTHON和PYSPARK_DRIVER_PYTHON是否正确设置。

  

1 个答案:

答案 0 :(得分:0)

您有python版本问题。您的工作节点Python版本(2.7)与驱动程序节点Python版本(3.5)不同。请安装正确的版本。