我刚刚阅读了findspark并发现它非常有趣,因为到目前为止我只使用了spark-submit
,它不适合在IDE上进行交互式开发。我尝试在Windows 10,Anaconda 4.4.0,Python 3.6.1,IPython 5.3.0,Spyder 3.1.4,Spark 2.1.1上执行此文件:
def inc(i):
return i + 1
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(master='local',
appName='test1')
print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))
Spyder生成命令runfile('C:/tests/temp1.py', wdir='C:/tests')
,并按预期打印出[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
。但是,如果我尝试使用在Ubuntu上运行的Spark集群,我会收到错误:
def inc(i):
return i + 1
import findspark
findspark.init()
import pyspark
sc = pyspark.SparkContext(master='spark://192.168.1.57:7077',
appName='test1')
print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))
IPython错误:
Traceback (most recent call last):
File "<ipython-input-1-820bd4275b8c>", line 1, in <module>
runfile('C:/tests/temp.py', wdir='C:/tests')
File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 880, in runfile
execfile(filename, namespace)
File "C:\Anaconda3\lib\site-packages\spyder\utils\site\sitecustomize.py", line 102, in execfile
exec(compile(f.read(), filename, 'exec'), namespace)
File "C:/tests/temp.py", line 11, in <module>
print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))
File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\pyspark\rdd.py", line 808, in collect
port = self.ctx._jvm.PythonRDD.collectAndServe(self._jrdd.rdd())
File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\java_gateway.py", line 1133, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "C:\projects\spark-2.1.1-bin-hadoop2.7\python\lib\py4j-0.10.4-src.zip\py4j\protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
Py4JJavaError: An error occurred while calling
工人stderr:
ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "C:\Anaconda3\pythonw.exe": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1048)
at org.apache.spark.api.python.PythonWorkerFactory.startDaemon(PythonWorkerFactory.scala:163)
at org.apache.spark.api.python.PythonWorkerFactory.createThroughDaemon(PythonWorkerFactory.scala:89)
at org.apache.spark.api.python.PythonWorkerFactory.create(PythonWorkerFactory.scala:65)
at org.apache.spark.SparkEnv.createPythonWorker(SparkEnv.scala:116)
at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:128)
at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
出于某种原因,这是尝试在Linux slave上使用Windows二进制路径。任何想法如何克服这个?我在Spyder上使用Python控制台获得了相同的结果,但错误是Cannot run program "C:\Anaconda3\python.exe": error=2, No such file or directory
。实际上,当运行python temp.py
时,它也会从命令行发生。
即使从Windows提交到Linux,此版本也能正常运行:
def inc(i):
return i + 1
import pyspark
sc = pyspark.SparkContext(appName='test2')
print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))
spark-submit --master spark://192.168.1.57:7077 temp2.py
答案 0 :(得分:0)
我找到了解决方案,结果非常简单。 pyspark/context.py使用env变量PYSPARK_PYTHON
来确定Python可执行文件的路径,但默认为“正确”python
。但是,默认情况下,findspark overrides此env变量与sys.executable
匹配,这显然无法跨平台工作。
无论如何这里是工作代码供将来参考:
def inc(i):
return i + 1
import findspark
findspark.init(python_path='python') # <-- so simple!
import pyspark
sc = pyspark.SparkContext(master='spark://192.168.1.57:7077',
appName='test1')
print(repr(sc.parallelize(tuple(range(10))).map(inc).collect()))