我当前的设置:
我正在使用这个问题的非常简单的代码:
rdd = sc.parallelize([1, 2])
rdd.collect()
Spark Standalone中按预期工作的PySpark内核在内核json文件中具有以下环境变量:
"PYSPARK_SUBMIT_ARGS": "--master spark://<spark_master>:7077 pyspark-shell"
但是,当我尝试在yarn-client模式下运行时,它会永远停滞不前,而JupyerHub日志的日志输出是:
16/12/12 16:45:21 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:36 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:45:51 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
16/12/12 16:46:06 WARN YarnScheduler: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
如上所述here我添加了 HADOOP_CONF_DIR 环境。变量指向Hadoop配置所在的目录,并将 PYSPARK_SUBMIT_ARGS --master
属性更改为&#34; yarn-client &#34;。此外,我可以确认在此期间没有其他工作正在运行,并且工作人员已正确注册。
我的印象是,可以配置一个带有PySpark内核的JupyterHub笔记本,以YARN作为other people have done it运行,如果确实是这样,我做错了吗?
答案 0 :(得分:1)
为了让你的pyspark在纱线模式下工作,你必须做一些额外的配置:
通过复制来配置用于远程纱线连接的纱线
您的jupyter实例的hadoop-yarn-server-web-proxy-<version>.jar
中的纱线群集的<local hadoop directory>/hadoop-<version>/share/hadoop/yarn/
(您需要一个当地的hadoop)
在hive-site.xml
<local spark directory>/spark-<version>/conf/
在yarn-site.xml
<local hadoop directory>/hadoop-<version>/hadoop-<version>/etc/hadoop/
设置环境变量:
export HADOOP_HOME=<local hadoop directory>/hadoop-<version>
export SPARK_HOME=<local spark directory>/spark-<version>
export HADOOP_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
export YARN_CONF_DIR=<local hadoop directory>/hadoop-<version>/etc/hadoop
现在,您可以创建内核vim /usr/local/share/jupyter/kernels/pyspark/kernel.json
{
"display_name": "pySpark (Spark 2.1.0)",
"language": "python",
"argv": [
"/opt/conda/envs/python35/bin/python",
"-m",
"ipykernel",
"-f",
"{connection_file}"
],
"env": {
"PYSPARK_PYTHON": "/opt/conda/envs/python35/bin/python",
"SPARK_HOME": "/opt/mapr/spark/spark-2.1.0",
"PYTHONPATH": "/opt/mapr/spark/spark-2.1.0/python/lib/py4j-0.10.4-src.zip:/opt/mapr/spark/spark-2.1.0/python/",
"PYTHONSTARTUP": "/opt/mapr/spark/spark-2.1.0/python/pyspark/shell.py",
"PYSPARK_SUBMIT_ARGS": "--master yarn pyspark-shell"
}
}
重新启动你的jupyterhub,你应该看到pyspark。由于uid = 1,root用户通常不具有yarn权限。您应该与其他用户
答案 1 :(得分:0)
我希望my case可以帮到你。
我只需传递一个参数来配置网址:
import findspark
findspark.init()
from pyspark import SparkContext
sc = SparkContext("yarn-clinet", "First App")