SPARK驱动程序在读取多个S3文件时内存不足

时间:2017-10-05 23:31:02

标签: java hadoop apache-spark amazon-s3

情况

我是SPARK的新手,我在EMR中运行一个SPARK工作,它读取一堆S3文件并执行Map / reduce作业。共有200个S3位置,平均包含400个文件。

在最后一个示例中,使用逗号分隔的S3路径和通配符(*)调用textFile(...) API:

sc.textFile("S3://FilePath1/\*","S3://FilePath2/\*"....."S3://FilePath200/\*")

该作业在驱动程序中花费了大量时间,最后因以下错误而耗尽内存。

Container [pid=66583,containerID=container_1507231957101_0001_02_000001] is running beyond physical memory limits. 
Current usage: 1.5 GB of 1.4 GB physical memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.
Dump of the process-tree for container_1507231957101_0001_02_000001

问题

  1. 我使用下面的代码将驱动程序内存设置为32g,但驱动程序仍以1.4g运行。我错过了什么吗?我正在使用spark-submit --verbose --deploy-mode cluster
  2. 提交作业
    private void initializeSparkContext() {
        final SparkConf conf = new SparkConf().setAppName(comparisonJobArgument.getAppName());
        conf.set("spark.driver.memory", "32g");
        conf.set("spark.files.maxPartitionBytes", "134217728");
        context = new JavaSparkContext(conf);
    }
    

    添加更多代码

    RDD1

    context
        .textFile(commaSeperatedS3Locations) // 200 folder like s3://path/* with 400 items in each folder
        .mapPartitions(StringToObjectTransformer())
        .filter(filter)  
    

    RDD2

    context
      .textFile(commaSeperatedS3Locations) // 1280 s3 files
      .mapPartitions(StringToObjectTransformer())
      .filter(filter)
      .map(Object1ToObject2Transformer())
      .flatMap(k -> k.iterator())
    

    RDD3

    context.union(RDD1)
      .union(RDD2)
      .map(Object1ToObject2Transformer)
      .mapToPair(mapToPairObject)
      .reduceByKey()
      .coalase(320,false)
      .cache(); // I have total of 1TB executor memory.
    

    saveAsTextFile陈述:

    RDD3.filter(filter1).saveToTextFile(s3://OutputPath1);
    RDD3.filter(filter2).saveToTextFile(s3://OutputPath2);
    RDD3.filter(filter3).saveToTextFile(s3://OutputPath3);
    RDD3.filter(filter4).saveToTextFile(s3://OutputPath4);
    RDD3.filter(filter5).saveToTextFile(s3://OutputPath5);
    

    非常感谢您对此的帮助。

    提前致谢。

    完整错误消息

    Application application_1507231957101_0001 failed 2 times due to AM Container for appattempt_1507231957101_0001_000002 exited with exitCode: -104
    For more detailed output, check application tracking page:http://ip-172-16-0-98.us-west-2.compute.internal:8088/cluster/app/application_1507231957101_0001Then, click on links to logs of each attempt.
    **Diagnostics: Container [pid=66583,containerID=container_1507231957101_0001_02_000001] is running beyond physical memory limits. Current usage: 1.5 GB of 1.4 GB physical memory used; 3.3 GB of 6.9 GB virtual memory used. Killing container.***
    Dump of the process-tree for container_1507231957101_0001_02_000001 :
    |- PID PPID PGRPID SESSID CMD_NAME USER_MODE_TIME(MILLIS) SYSTEM_TIME(MILLIS) VMEM_USAGE(BYTES) RSSMEM_USAGE(PAGES) FULL_CMD_LINE
    |- 66583 66581 66583 66583 (bash) 0 0 115814400 688 /bin/bash -c LD_LIBRARY_PATH=/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native::/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native:/usr/lib/hadoop/lib/native /usr/lib/jvm/java-openjdk/bin/java -server -Xmx1024m -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1507231957101_0001/container_1507231957101_0001_02_000001/tmp '-XX:+UseConcMarkSweepGC' '-XX:CMSInitiatingOccupancyFraction=70' '-XX:MaxHeapFreeRatio=70' '-XX:+CMSClassUnloadingEnabled' '-XX:OnOutOfMemoryError=kill -9 %p' -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1507231957101_0001/container_1507231957101_0001_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class 'com.amazon.reconcentral.comparisonengine.jobs.main.ComparisonJob' --jar s3://recon-central-test-usamazon/lib/comparison-engine/ReconCentralComparisonEngine-1.0-super.jar --arg '-s3B' --arg 'recon-central-test-usamazon' --arg '-s3L' --arg 'var/args/comparison-engine/ComparisonEngine:RC_ACETOUSL_ALLREGION.750.bWcQFMA.301-d25518a5-459e-49f0-8d6b-71ad695bbb7f.json' --arg '-s3E' --arg '3ebfb91d-faf0-4295-a5d9-408080e71841' --properties-file /mnt/yarn/usercache/hadoop/appcache/application_1507231957101_0001/container_1507231957101_0001_02_000001/__spark_conf__/__spark_conf__.properties 1> /var/log/hadoop-yarn/containers/application_1507231957101_0001/container_1507231957101_0001_02_000001/stdout 2> /var/log/hadoop-yarn/containers/application_1507231957101_0001/container_1507231957101_0001_02_000001/stderr
    |- 66588 66583 66583 66583 (java) 27893 936 3445600256 385188 /usr/lib/jvm/java-openjdk/bin/java -server -Xmx1024m -Djava.io.tmpdir=/mnt/yarn/usercache/hadoop/appcache/application_1507231957101_0001/container_1507231957101_0001_02_000001/tmp -XX:+UseConcMarkSweepGC -XX:CMSInitiatingOccupancyFraction=70 -XX:MaxHeapFreeRatio=70 -XX:+CMSClassUnloadingEnabled -XX:OnOutOfMemoryError=kill -9 %p -Dspark.yarn.app.container.log.dir=/var/log/hadoop-yarn/containers/application_1507231957101_0001/container_1507231957101_0001_02_000001 org.apache.spark.deploy.yarn.ApplicationMaster --class com.amazon.reconcentral.comparisonengine.jobs.main.ComparisonJob --jar s3://recon-central-test-usamazon/lib/comparison-engine/ReconCentralComparisonEngine-1.0-super.jar --arg -s3B --arg recon-central-test-usamazon --arg -s3L --arg var/args/comparison-engine/ComparisonEngine:RC_ACETOUSL_ALLREGION.750.bWcQFMA.301-d25518a5-459e-49f0-8d6b-71ad695bbb7f.json --arg -s3E --arg 3ebfb91d-faf0-4295-a5d9-408080e71841 --properties-file /mnt/yarn/usercache/hadoop/appcache/application_1507231957101_0001/container_1507231957101_0001_02_000001/__spark_conf__/__spark_conf__.properties
    Container killed on request. Exit code is 143
    Container exited with a non-zero exit code 143
    Failing this attempt. Failing the application.
    

1 个答案:

答案 0 :(得分:2)

  

SPARK_WORKER_MEMORY仅用于独立部署模式

     

SPARK_EXECUTOR_MEMORY用于YARN部署模式

你可以使用:

启动你的spark-shell
./bin/spark-shell --driver-memory 40g

您可以在spark-defaults.conf中设置它:

spark.driver.memory 40g

如果使用spark-submit启动应用程序,则必须将驱动程序内存指定为参数:

./bin/spark-submit --driver-memory 40g --class main.class yourApp.jar
  

直接在SparkConf上设置的属性取最高优先级,然后   标志传递给spark-submit或spark-shell,然后是选项   spark-defaults.conf文件。

这是优先顺序(从最高到最低):

  1. 在SparkConf上设置的属性(在程序中)。标志传递给
  2. spark-submit或spark-shell。
  3. spark-defaults.conf中设置的选项 文件。
  4. 在客户端部署模式下在Spark独立群集上运行

    ./bin/spark-submit \
      --class org.apache.spark.examples.SparkPi \
      --master spark://207.184.161.138:7077 \
      --executor-memory 20G \
      --total-executor-cores 100 \
      /path/to/examples.jar \
      1000
    

    在群集部署模式下使用监督

    在Spark独立群集上运行
    ./bin/spark-submit \
      --class org.apache.spark.examples.SparkPi \
      --master spark://207.184.161.138:7077 \
      --deploy-mode cluster \
      --supervise \
      --executor-memory 20G \
      --total-executor-cores 100 \
      /path/to/examples.jar \
      1000
    

    在YARN群集上运行

    export HADOOP_CONF_DIR=XXX
    ./bin/spark-submit \
      --class org.apache.spark.examples.SparkPi \
      --master yarn \
      --deploy-mode cluster \  # can be client for client mode
      --executor-memory 20G \
      --num-executors 50 \
      /path/to/examples.jar \
      1000
    

    在Spark独立群集上运行Python应用程序

    ./bin/spark-submit \
      --master spark://207.184.161.138:7077 \
      examples/src/main/python/pi.py \
      1000
    

    在群集部署模式下使用监督

    在Mesos群集上运行
    ./bin/spark-submit \
      --class org.apache.spark.examples.SparkPi \
      --master mesos://207.184.161.138:7077 \
      --deploy-mode cluster \
      --supervise \
      --executor-memory 20G \
      --total-executor-cores 100 \
      http://path/to/examples.jar \
      1000
    

    * http://docs.aws.amazon.com/emr/latest/ReleaseGuide/emr-spark-configure.html