我有一个Docker容器正在运行Apache Spark,主要工作者和奴隶工作者。我正试图从主机上的Jupyter笔记本上提交作业。见下文:
# Init
!pip install findspark
import findspark
findspark.init()
# Context setup
from pyspark import SparkConf, SparkContext
# Docker container is exposing port 7077
conf = SparkConf().setAppName('test').setMaster('spark://localhost:7077')
sc = SparkContext(conf=conf)
sc
# Execute step
import random
num_samples = 1000
def inside(p):
x, y = random.random(), random.random()
return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)
执行步骤显示以下错误:
Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException:
Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.17.0.2, executor 0):
java.io.IOException: Cannot run program "/Users/omar/anaconda3/bin/python": error=2, No such file or directory
在我看来,该命令试图在本地运行Spark作业时应将其发送到前面步骤中指定的Spark master。这不是通过Jupyter笔记本电脑实现的吗?
我的容器基于https://hub.docker.com/r/p7hb/docker-spark/,但我在/usr/bin/python3.6
下安装了Python 3.6。
答案 0 :(得分:4)
在创建SparkContext之前我必须执行以下 :
import os
# Path on master/worker where Python is installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.6'
一些研究显示我需要通过以下内容将其添加到/usr/local/spark/conf/spark-env.sh
:
export PYSPARK_PYTHON='/usr/bin/python3.6'
但这不起作用。