我在pyspark上输入了这些命令
In [1]: myrdd = sc.textFile("Cloudera-cdh5.repo")
In [2]: myrdd.map(lambda x:x.upper()).collect()
当我执行' myrdd.map(lambda x:x.upper())。collect()'时,我遇到了错误
以下是错误信息
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 4 times, most recent failure: Lost task 0.3 in stage 0.0 (TID 6, tiger): java.io.IOException: Cannot run program "/usr/local/bin/python3": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1047)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:160)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:86)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:62)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:135)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:73)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:88)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:186)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1028)
... 13 more
文件/ usr / local / bin / python3存在于磁盘上
我如何解决上述错误?
答案 0 :(得分:4)
对于使用 Windows 的用户: 在您的 conf 目录中创建一个 spark-env.cmd 文件,并将以下行放入 spark-env.cmd 文件中。
set PYSPARK_PYTHON=C:\Python39\python.exe
This stack-overflow answer explains about setting environment variables for pyspark in windows
答案 1 :(得分:0)
您需要在/usr/local/bin/python3
此路径上授予访问权限,您可以使用命令sudo chmod 777 /usr/local/bin/python3/*
。
我认为这个问题是由变量PYSPARK_PYTHON发生的,它用于为每个节点指向python的位置,你可以使用下面的命令
export PYSPARK_PYTHON=/usr/local/bin/python3
答案 2 :(得分:0)
更多“愚蠢”的问题,而不是权限问题,可能仅仅是因为您没有安装python3或路径变量可能是错误的。
答案 3 :(得分:0)
您还可以将python设置为python3
sudo alternatives --set python /usr/bin/python3
python --version
答案 4 :(得分:0)
我使用的是 Windows 10 并面临同样的问题。我可以简单地通过处理 python.exe 并将其重命名为 python3.exe 并在环境变量路径中设置 python.exe 文件夹来修复它。