PySpark内核(JupyterHub)可以在纱线客户端模式下运行吗?

时间:2016-12-12 16:55:37

标签: pyspark yarn jupyterhub spark-ec2

我当前的设置:

  • 带有HDFS和YARN的Spark EC2群集
  • JuputerHub(0.7.0)
  • 使用python27的PySpark内核

我正在使用这个问题的非常简单的代码:

rdd = sc.parallelize([1, 2])
rdd.collect()

Spark Standalone中按预期工作的PySpark内核在内核json文件中具有以下环境变量:

"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"

但是,当我尝试在yarn-client模式下运行时,它会永远停滞不前,而JupyerHub日志的日志输出是:

16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources

如上所述here我添加了 HADOOP_CONF_DIR 环境。变量指向Hadoop配置所在的目录,并将 PYSPARK_SUBMIT_ARGS --master属性更改为&#34; yarn-client &#34;。此外,我可以确认在此期间没有其他工作正在运行,并且工作人员已正确注册。

我的印象是,可以配置一个带有PySpark内核的JupyterHub笔记本,以YARN作为other people have done it运行,如果确实是这样,我做错了吗?

2 个答案:

答案 0 :(得分:1)

为了让你的pyspark在纱线模式下工作,你必须做一些额外的配置:

  1. 通过复制来配置用于远程纱线连接的纱线 您的jupyter实例的hadoop-yarn-server-web-proxy-<version>.jar中的纱线群集的<local hadoop directory>/hadoop-<version>/share/hadoop/yarn/(您需要一个当地的hadoop)

  2. hive-site.xml

  3. 中复制群集的<local spark directory>/spark-<version>/conf/
  4. yarn-site.xml

  5. 中复制群集的<local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
  6. 设置环境变量:

    • export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
    • export SPARK_HOME=<local spark directory>/spark-<version>
    • export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
    • export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
  7. 现在,您可以创建内核vim /usr/local/share/jupyter/kernels/pyspark/kernel.json { "display_name": "pySpark (Spark 2.1.0)", "language": "python", "argv": [ "/opt/conda/envs/python35/bin/python", "-m", "ipykernel", "-f", "{connection_file}" ], "env": { "PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python", "SPARK_HOME": "/opt/mapr/spark/spark-2.1.0", "PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/", "PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py", "PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell" } }

  8. 重新启动你的jupyterhub,你应该看到pyspark。由于uid = 1,root用户通常不具有yarn权限。您应该与其他用户

  9. 连接到jupyterhub

答案 1 :(得分:0)

我希望my case可以帮到你。

我只需传递一个参数来配置网址:

import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")