我有一个带有一个主节点和四个工作节点的EMR集群。每个节点有4个内核和16GB的RAM。我尝试运行以下代码以使Logistic回归符合我的数据:
from pyspark.ml.evaluation import BinaryClassificationEvaluator, MulticlassClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
lr = LogisticRegression(labelCol="label", featuresCol="features")
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 2.0])
.addGrid(lr.elasticNetParam, [0.0, 1.0])
.addGrid(lr.maxIter, [5, 10])
.build())
# Create 2-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=MulticlassClassificationEvaluator(metricName="f1"), numFolds=2)
# Run cross validations
cvModel = cv.fit(age_training_data)
我的spark-defaults.conf是:
spark.driver.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled false
spark.driver.memory 10g
spark.yarn.driver.memoryOverhead 5g
spark.driver.cores 3
spark.executor.memory 10g
spark.yarn.executor.memoryOverhead 2048m
spark.executor.instances 4
spark.executor.cores 2
spark.default.parallelism 48
spark.yarn.max.executor.failures 32
spark.network.timeout 10000000s
spark.rpc.askTimeout 10000000s
spark.executor.heartbeatInterval 10000000s
spark.yarn.historyServer.address emr-header-1.cluster-60683:18080
spark.ui.view.acls *
#spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -XX:+UseG1GC
#spark.kryoserializer.buffer.max 128m
许多执行者在整个过程中被杀死。失败容器的标准之一是:
#
# A fatal error has been detected by the Java Runtime Environment:
#
# SIGSEGV (0xb) at pc=0x00007fe1926cb033, pid=21382, tid=0x00007fe1908db700
#
# JRE version: OpenJDK Runtime Environment (8.0_151-b12) (build 1.8.0_151-b12)
# Java VM: OpenJDK 64-Bit Server VM (25.151-b12 mixed mode linux-amd64 compressed oops)
# Problematic frame:
# V [libjvm.so+0x5aa033]
#
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/hs_err_pid21382.log
[thread 140606771681024 also had an error]
[thread 140606768523008 also had an error]
#
# If you would like to submit a bug report, please visit:
# http://bugreport.java.com/bugreport/crash.jsp
#
运行代码时pyspark shell中显示的消息是:
WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: Container marked as failed: container_1521314237048_0001_01_000005 on host: emr-worker-1.cluster-60683. Exit status: 134. Diagnostics: Exception from container-launch.
Container id: container_1521314237048_0001_01_000005
Exit code: 134
Exception message: /bin/bash: line 1: 21382 Aborted LD_LIBRARY_PATH=/usr/lib/hadoop-current/lib/native::/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native /usr/lib/jvm/java/bin/java -server -Xmx10240m '-XX:+UseG1GC' -Djava.io.tmpdir=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/tmp '-Dspark.driver.port=33390' '-Dspark.rpc.askTimeout=10000000s' -Dspark.yarn.app.container.log.dir=/mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.16.9.204:33390 --executor-id 4 --hostname emr-worker-1.cluster-60683 --cores 2 --app-id application_1521314237048_0001 --user-class-path file:/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/__app__.jar > /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stdout 2> /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stderr
Stack trace: ExitCodeException exitCode=134: /bin/bash: line 1: 21382 Aborted LD_LIBRARY_PATH=/usr/lib/hadoop-current/lib/native::/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/usr/lib/hadoop-current/lib/native::/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native:/opt/apps/ecm/service/hadoop/2.7.2-1.2.11/package/hadoop-2.7.2-1.2.11/lib/native /usr/lib/jvm/java/bin/java -server -Xmx10240m '-XX:+UseG1GC' -Djava.io.tmpdir=/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/tmp '-Dspark.driver.port=33390' '-Dspark.rpc.askTimeout=10000000s' -Dspark.yarn.app.container.log.dir=/mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005 -XX:OnOutOfMemoryError='kill %p' org.apache.spark.executor.CoarseGrainedExecutorBackend --driver-url spark://CoarseGrainedScheduler@172.16.9.204:33390 --executor-id 4 --hostname emr-worker-1.cluster-60683 --cores 2 --app-id application_1521314237048_0001 --user-class-path file:/mnt/disk3/yarn/usercache/hadoop/appcache/application_1521314237048_0001/container_1521314237048_0001_01_000005/__app__.jar > /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stdout 2> /mnt/disk3/log/hadoop-yarn/containers/application_1521314237048_0001/container_1521314237048_0001_01_000005/stderr
at org.apache.hadoop.util.Shell.runCommand(Shell.java:545)
at org.apache.hadoop.util.Shell.run(Shell.java:456)
at org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:722)
at org.apache.hadoop.yarn.server.nodemanager.DefaultContainerExecutor.launchContainer(DefaultContainerExecutor.java:212)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:302)
at org.apache.hadoop.yarn.server.nodemanager.containermanager.launcher.ContainerLaunch.call(ContainerLaunch.java:82)
at java.util.concurrent.FutureTask.run(FutureTask.java:266)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Container exited with a non-zero exit code 134
非常感谢如果有人能让我知道在整个训练过程中执行者失败的原因是什么。非常感谢!