我正在使用部署在Alicloud EMR集群上的Spark,其中包含1个主节点(4个内核,16GB内存)和4个工作节点(4个内核,每个实例有16GB内存)。运行Spark应用程序的模式是yarn-client模式,因为我打算在Spark shell中运行它。
以下是我的应用程序代码的一部分:
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator
lr = LogisticRegression(labelCol="label", featuresCol="features")
evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")
paramGrid = (ParamGridBuilder()
.addGrid(lr.regParam, [0.01, 2.0])
.addGrid(lr.elasticNetParam, [0.0, 1.0])
.addGrid(lr.maxIter, [10, 20])
.build())
# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2)
# Run cross validations
cvModel = cv.fit(gender_subset_train)
spark-conf.conf是:
spark.history.fs.cleaner.enabled true
spark.eventLog.enabled true
spark.eventLog.dir hdfs://emr-header-1.cluster-61487:9000/spark-history
spark.driver.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled false
spark.driver.memory 5g
spark.yarn.driver.memoryOverhead 10g
spark.driver.cores 3
spark.executor.memory 9g
spark.yarn.executor.memoryOverhead 2048m
spark.executor.instances 4
spark.executor.cores 2
spark.default.parallelism 48
spark.yarn.max.executor.failures 32
spark.network.timeout 100000000s
spark.rpc.askTimeout 10000000s
spark.executor.heartbeatInterval 100000000s
spark.yarn.historyServer.address emr-header-1.cluster-61487:18080
spark.ui.view.acls *
#spark.serializer org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions -XX:+UseG1GC
#spark.kryoserializer.buffer.max 128m
spark.executor.extraJavaOptions -XX:ErrorFile=/tmp/hs_err_pid.log
#spark.local.dir /mnt/disk1, /mnt/disk2, /mnt/disk3, /mnt/disk4
spark.driver.maxResultSize 3g
当我尝试运行上面的代码时,ml训练部分可能最初启动但在中间失败并出现以下错误消息:
cvModel = cv.fit(gender_subset_train)
18/04/10 20:35:58 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
18/04/10 20:35:58 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000712900000, 553648128, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 553648128 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hsperfdata_hadoop/hs_err_pid21011.log
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
response = connection.send_command(command)
File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/lib/spark-current/python/pyspark/ml/base.py", line 64, in fit
return self._fit(dataset)
File "/usr/lib/spark-current/python/pyspark/ml/tuning.py", line 238, in _fit
metric = eva.evaluate(model.transform(validation, epm[j]))
File "/usr/lib/spark-current/python/pyspark/ml/evaluation.py", line 69, in evaluate
return self._evaluate(dataset)
File "/usr/lib/spark-current/python/pyspark/ml/evaluation.py", line 99, in _evaluate
return self._java_obj.evaluate(dataset._jdf)
File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
File "/usr/lib/spark-current/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o67.evaluate
并且运行free -m的结果是:
total used free shared buff/cache available
Mem: 15886 10256 3723 227 1906 4799
Swap: 2047 2036 11
请同时查看完整的JVM /tmp/hsperfdata_hadoop/hs_err_pid21011.log
:
https://textsaver.flap.tv/lists/1wse
是否由于缺少驱动程序记忆而应该增加实例RAM?任何帮助都会非常感激,因为我已经解决了很长时间的问题。非常感谢你!