使用Livy通过从EMR启动的POST请求执行S3中存储的脚本。脚本运行,但是超时非常快。我尝试编辑livy.conf配置,但似乎所有更改都没有保留。这是返回的错误:
java.lang.Exception:在120秒内未找到带有标签livy-batch-10-hg3po7kp的YARN应用程序。请检查您的群集状态,可能很忙。 org.apache.livy.utils.SparkYarnApp.org $ apache $ livy $ utils $ SparkYarnApp $$ getAppIdFromTag(SparkYarnApp.scala:182)org.apache.livy.utils.SparkYarnApp $ anonfun $ 1 $$ anonfun $ 4.apply(SparkYarnApp .scala:239)org.apache.livy.utils.SparkYarnApp $$ anonfun $ 1 $$ anonfun $ 4.apply(SparkYarnApp.scala:236)scala.Option.getOrElse(Option.scala:121)org.apache.livy.utils .SparkYarnApp $$ anonfun $ 1.apply $ mcV $ sp(SparkYarnApp.scala:236)org.apache.livy.Utils $$ anon $ 1.run(Utils.scala:94)
答案 0 :(得分:0)
这是一个棘手的问题,但我可以通过以下命令使用它:
curl -X POST --data '{"proxyUser": "hadoop","file": "s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/hello.py", "jars": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/NQjc.jar"], "pyFiles": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/application.zip"], "archives": ["s3://MYBUCKETLOCATION/recurring_job_automation/sample-pyspark-app/venv.zip#venv"], "driverMemory": "10g", "executorMemory": "10g", "name": "Name of Import Job here", "conf":{
"spark.yarn.appMasterEnv.SPARK_HOME": "/usr/lib/spark",
"spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"livy.spark.yarn.appMasterEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.yarn.executorEnv.PYSPARK_PYTHON": "./venv/bin/python",
"spark.pyspark.virtualenv.enabled": "true",
"spark.pyspark.virtualenv.type": "native",
"spark.pyspark.virtualenv.requirements":"requirements.pip",
"spark.pyspark.virtualenv.bin.path": "virtualenv",
"spark.master": "yarn",
"spark.submit.deployMode": "cluster"}}' -H "Content-Type: application/json" http://MY-PATH--TO-MY--EMRCLUSTER:8998/batches
在EMR群集的主节点上运行此脚本以设置依赖关系之后,我克隆了包含应用程序文件的存储库:
set -e
set -x
export HADOOP_CONF_DIR="/etc/hadoop/conf"
export PYTHON="/usr/bin/python3"
export SPARK_HOME="/usr/lib/spark"
export PATH="$SPARK_HOME/bin:$PATH"
# Set $PYTHON to the Python executable you want to create
# your virtual environment with. It could just be something
# like `python3`, if that's already on your $PATH, or it could
# be a /fully/qualified/path/to/python.
test -n "$PYTHON"
# Make sure $SPARK_HOME is on your $PATH so that `spark-submit`
# runs from the correct location.
test -n "$SPARK_HOME"
"$PYTHON" -m venv venv --copies
source venv/bin/activate
pip install -U pip
pip install -r requirements.pip
deactivate
# Here we package up an isolated environment that we'll ship to YARN.
# The awkward zip invocation for venv just creates nicer relative
# paths.
pushd venv/
zip -rq ../venv.zip *
popd
# Here it's important that application/ be zipped in this way so that
# Python knows how to load the module inside.
zip -rq application.zip application/
按照我在此处提供的说明进行操作:Bundling Python3 packages for PySpark results in missing imports
如果遇到任何问题,在此处检查Livy日志将很有帮助:
/var/log/livy/livy-livy-server.out
以及显示在Hadoop Resource Manager UI中的日志,一旦您进入EMR主节点并设置了Web浏览器代理,就可以从EMR控制台中的链接访问该日志。
此解决方案的一个关键方面是,由于文件https://issues.apache.org/jira/browse/LIVY-222
因此,我能够通过利用EMRFS引用我上传到S3的文件来解决该问题。另外,对于virtualenv(如果您使用的是PySpark),使用--copies参数非常重要,否则最终将获得无法在HDFS中使用的符号链接。
在这里还报告了使用virtualenv的问题:https://issues.apache.org/jira/browse/SPARK-13587 与PySpark相关联(可能不适用于您),因此我需要通过添加其他参数来解决它们。这里也提到了其中一些:https://community.hortonworks.com/articles/104947/using-virtualenv-with-pyspark.html
无论如何,由于Livy在上载本地文件时遇到问题,直到我通过EMRFS从S3引用文件来解决此问题之前,Livy都会失败,因为它无法将文件上载到暂存目录。另外,当我尝试在HDFS中提供绝对路径而不是使用S3时,由于HDFS资源是由hadoop用户(而非livy用户)拥有的,因此livy无法访问它们并将其复制到暂存目录中以执行作业。因此,必须通过EMRFS引用S3中的文件。
答案 1 :(得分:0)