如何将PySpark作业从本地Jupyter笔记本运行到Docker容器中的Spark master?

时间:2017-06-27 19:50:40

标签: python apache-spark pyspark

我有一个Docker容器正在运行Apache Spark,主要工作者和奴隶工作者。我正试图从主机上的Jupyter笔记本上提交作业。见下文:

# Init
!pip install findspark
import findspark
findspark.init()


# Context setup
from pyspark import SparkConf, SparkContext
# Docker container is exposing port 7077
conf = SparkConf().setAppName('test').setMaster('spark://localhost:7077')
sc = SparkContext(conf=conf)
sc

# Execute step
import random
num_samples = 1000
def inside(p):     
  x, y = random.random(), random.random()
  return x*x + y*y < 1
count = sc.parallelize(range(0, num_samples)).filter(inside).count()
pi = 4 * count / num_samples
print(pi)

执行步骤显示以下错误:

Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe.
: org.apache.spark.SparkException: 
    Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, 172.17.0.2, executor 0): 

    java.io.IOException: Cannot run program "/Users/omar/anaconda3/bin/python": error=2, No such file or directory

在我看来,该命令试图在本地运行Spark作业时应将其发送到前面步骤中指定的Spark master。这不是通过Jupyter笔记本电脑实现的吗?

我的容器基于https://hub.docker.com/r/p7hb/docker-spark/,但我在/usr/bin/python3.6下安装了Python 3.6。

1 个答案:

答案 0 :(得分:4)

在创建SparkContext之前我必须执行以下

import os
# Path on master/worker where Python is installed
os.environ['PYSPARK_PYTHON'] = '/usr/bin/python3.6'

一些研究显示我需要通过以下内容将其添加到/usr/local/spark/conf/spark-env.sh

export PYSPARK_PYTHON='/usr/bin/python3.6'

但这不起作用。

相关问题