我需要在我的机器上设置pySpark以访问和读取远程Hadoop集群上的数据,但我遇到了一些问题。
这是我遵循的步骤
1)brew install apache-spark
2)export SPARK_HOME=/usr/local/Cellar/apache-spark/1.6.1
export PYTHONPATH=$SPARK_HOME/libexec/python:$SPARK_HOME/libexec/python/build:$PYTHONPATH
PYTHONPATH=$SPARK_HOME/python/lib/py4j-0.8.2.1-src.zip:$PYTHONPATH
export PYTHONPATH=$SPARK_HOME/python:$SPARK_HOME/python/build:$PYTHONPATH
3)
export HADOOP_USER_NAME=hdfs
export HADOOP_CONF_DIR=yarnconfig
在yarnconfig
我在yarn-site.xml
<configuration>
<property>
<name>yarn.resourcemanager.hostname</name>
<value>{Hadoop_Cluster_IP}</value>
</property>
<property>
<name>yarn.resourcemanager.address</name>
<value>${yarn.resourcemanager.hostname}:8050</value>
</property>
</configuration>
此处{Hadoop_Cluster_IP}
是我尝试连接的Hadoop群集的IP地址的占位符,我出于安全原因未显示该地址。
然后,在python shell中我做
from pyspark import SparkContext, SparkConf
conf = SparkConf().setMaster("local").setAppName("LogParser")
sc = SparkContext(conf = conf)
但是我收到以下错误消息
/usr/local/Cellar/apache-spark/1.6.1/bin/load-spark-env.sh: line 2: /usr/local/Cellar/apache-spark/1.6.1/libexec/bin/load-spark-env.sh: Permission denied
/usr/local/Cellar/apache-spark/1.6.1/bin/load-spark-env.sh: line 2: exec: /usr/local/Cellar/apache-spark/1.6.1/libexec/bin/load-spark-env.sh: cannot execute: Undefined error: 0
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/conf.py", line 104, in __init__
SparkContext._ensure_initialized()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/context.py", line 245, in _ensure_initialized
SparkContext._gateway = gateway or launch_gateway()
File "/usr/local/Cellar/apache-spark/1.6.1/libexec/python/pyspark/java_gateway.py", line 94, in launch_gateway
raise Exception("Java gateway process exited before sending the driver its port number")
Exception: Java gateway process exited before sending the driver its port number
你知道可能出现什么问题吗?