我正在学习第一次使用AWS EMR提交我的Spark作业。我使用的脚本非常简短( restaurant.py ):
from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession
class SparkRawConsumer:
def __init__(self):
self.sparkContext = SparkContext.getOrCreate()
self.sparkContext.setLogLevel("ERROR")
self.sqlContext = SQLContext(self.sparkContext)
self.df = self.sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('zomato.csv')
if __name__ == "__main__":
sparkConsumer = SparkRawConsumer()
print(sparkConsumer.df.count())
sparkConsumer.df.groupBy("City").agg({"Average Cost for two": "avg", "Aggregate rating": "avg"})
我使用AWS GUI提交了我的步骤,但CLI导出为
spark-submit --deploy-mode cluster s3://data-pipeline-testing-yu-chen/dependencies/restaurant.py -files s3://data-pipeline-testing-yu-chen/dependencies/zomato.csv
但是,该步骤将运行几分钟,然后返回退出代码1。我非常困惑到底发生了什么,并且发现很难解释syserr的输出:
18/07/28 06:40:10 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:11 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:12 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:13 INFO Client: Application report for application_1532756827478_0012 (state: FINISHED)
18/07/28 06:40:13 INFO Client:
client token: N/A
diagnostics: User application exited with status 1
ApplicationMaster host: myip
ApplicationMaster RPC port: 0
queue: default
start time: 1532759825922
final status: FAILED
tracking URL: http://myip.compute.internal:20888/proxy/application_1532756827478_0012/
user: hadoop
18/07/28 06:40:13 INFO Client: Deleted staging directory hdfs://myip.compute.internal:8020/user/hadoop/.sparkStaging/application_1532756827478_0012
Exception in thread "main" org.apache.spark.SparkException: Application application_1532756827478_0012 finished with failed status
at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/28 06:40:13 INFO ShutdownHookManager: Shutdown hook called
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
Command exiting with ret '1'
我可以通过SSH进入我的主实例,然后发出spark-submit restaurant.py
来运行脚本。我已使用以下方法将CSV文件加载到我的主实例中:
[hadoop@my-ip ~]$ aws s3 sync s3://data-pipeline-testing-yu-chen/dependencies/ .
然后我将restaurant.csv
文件加载到HDFS中:
hadoop fs -put zomato.csv ./zomato.csv
我的猜测是,我传递的-files
选项没有按照我预期的方式使用,但是我对于如何解释控制台输出并开始调试感到迷茫