Question

我正在学习第一次使用AWS EMR提交我的Spark作业。我使用的脚本非常简短（ restaurant.py ）：

from pyspark import SparkContext, SQLContext
from pyspark.sql import SparkSession

class SparkRawConsumer:

def __init__(self):
    self.sparkContext = SparkContext.getOrCreate()

    self.sparkContext.setLogLevel("ERROR")
    self.sqlContext = SQLContext(self.sparkContext)
    self.df = self.sqlContext.read.format('com.databricks.spark.csv').options(header='true', inferschema='true').load('zomato.csv')


if __name__ == "__main__":
    sparkConsumer = SparkRawConsumer()
    print(sparkConsumer.df.count())
    sparkConsumer.df.groupBy("City").agg({"Average Cost for two": "avg", "Aggregate rating": "avg"})

我使用AWS GUI提交了我的步骤，但CLI导出为

spark-submit --deploy-mode cluster s3://data-pipeline-testing-yu-chen/dependencies/restaurant.py -files s3://data-pipeline-testing-yu-chen/dependencies/zomato.csv

但是，该步骤将运行几分钟，然后返回退出代码1。我非常困惑到底发生了什么，并且发现很难解释syserr的输出：

18/07/28 06:40:10 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:11 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:12 INFO Client: Application report for application_1532756827478_0012 (state: RUNNING)
18/07/28 06:40:13 INFO Client: Application report for application_1532756827478_0012 (state: FINISHED)
18/07/28 06:40:13 INFO Client: 
     client token: N/A
     diagnostics: User application exited with status 1
     ApplicationMaster host: myip
     ApplicationMaster RPC port: 0
     queue: default
     start time: 1532759825922
     final status: FAILED
     tracking URL: http://myip.compute.internal:20888/proxy/application_1532756827478_0012/
     user: hadoop
18/07/28 06:40:13 INFO Client: Deleted staging directory hdfs://myip.compute.internal:8020/user/hadoop/.sparkStaging/application_1532756827478_0012
Exception in thread "main" org.apache.spark.SparkException: Application application_1532756827478_0012 finished with failed status
    at org.apache.spark.deploy.yarn.Client.run(Client.scala:1165)
    at org.apache.spark.deploy.yarn.YarnClusterApplication.start(Client.scala:1520)
    at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:894)
    at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:198)
    at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:228)
    at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:137)
    at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/07/28 06:40:13 INFO ShutdownHookManager: Shutdown hook called
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
18/07/28 06:40:13 INFO ShutdownHookManager: Deleting directory /mnt/tmp/spark-dedwd323x
Command exiting with ret '1'

我可以通过SSH进入我的主实例，然后发出spark-submit restaurant.py来运行脚本。我已使用以下方法将CSV文件加载到我的主实例中：

[hadoop@my-ip ~]$ aws s3 sync s3://data-pipeline-testing-yu-chen/dependencies/ .

然后我将restaurant.csv文件加载到HDFS中：

hadoop fs -put zomato.csv ./zomato.csv

我的猜测是，我传递的-files选项没有按照我预期的方式使用，但是我对于如何解释控制台输出并开始调试感到迷茫

PySpark EMR步骤失败，退出代码为1

0 个答案: