Question

我正在使用部署在Alicloud EMR集群上的Spark，其中包含1个主节点（4个内核，16GB内存）和4个工作节点（4个内核，每个实例有16GB内存）。运行Spark应用程序的模式是yarn-client模式，因为我打算在Spark shell中运行它。

以下是我的应用程序代码的一部分：

from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.classification import GBTClassifier
from pyspark.ml.classification import MultilayerPerceptronClassifier
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

lr = LogisticRegression(labelCol="label", featuresCol="features")

evaluator = BinaryClassificationEvaluator(rawPredictionCol="rawPrediction")

paramGrid = (ParamGridBuilder()
             .addGrid(lr.regParam, [0.01, 2.0])
             .addGrid(lr.elasticNetParam, [0.0, 1.0])
             .addGrid(lr.maxIter, [10, 20])
             .build())

# Create 5-fold CrossValidator
cv = CrossValidator(estimator=lr, estimatorParamMaps=paramGrid, evaluator=BinaryClassificationEvaluator(), numFolds=2)

# Run cross validations
cvModel = cv.fit(gender_subset_train)

spark-conf.conf是：

spark.history.fs.cleaner.enabled true
spark.eventLog.enabled           true
spark.eventLog.dir               hdfs://emr-header-1.cluster-61487:9000/spark-history
spark.driver.extraLibraryPath    /usr/lib/hadoop-current/lib/native
spark.executor.extraLibraryPath  /usr/lib/hadoop-current/lib/native
spark.driver.extraJavaOptions    -Dlog4j.ignoreTCL=true
spark.executor.extraJavaOptions  -Dlog4j.ignoreTCL=true
spark.hadoop.yarn.timeline-service.enabled  false
spark.driver.memory                 5g
spark.yarn.driver.memoryOverhead    10g
spark.driver.cores                  3
spark.executor.memory               9g
spark.yarn.executor.memoryOverhead  2048m
spark.executor.instances             4
spark.executor.cores                 2
spark.default.parallelism           48
spark.yarn.max.executor.failures    32
spark.network.timeout               100000000s
spark.rpc.askTimeout                10000000s
spark.executor.heartbeatInterval    100000000s

spark.yarn.historyServer.address emr-header-1.cluster-61487:18080
spark.ui.view.acls *

#spark.serializer                    org.apache.spark.serializer.KryoSerializer
spark.executor.extraJavaOptions     -XX:+UseG1GC
#spark.kryoserializer.buffer.max     128m
spark.executor.extraJavaOptions     -XX:ErrorFile=/tmp/hs_err_pid.log
#spark.local.dir                     /mnt/disk1, /mnt/disk2, /mnt/disk3, /mnt/disk4
spark.driver.maxResultSize          3g

当我尝试运行上面的代码时，ml训练部分可能最初启动但在中间失败并出现以下错误消息：

cvModel = cv.fit(gender_subset_train)

18/04/10 20:35:58 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS
18/04/10 20:35:58 WARN netlib.BLAS: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS
OpenJDK 64-Bit Server VM warning: INFO: os::commit_memory(0x0000000712900000, 553648128, 0) failed; error='Cannot allocate memory' (errno=12)
#
# There is insufficient memory for the Java Runtime Environment to continue.
# Native memory allocation (mmap) failed to map 553648128 bytes for committing reserved memory.
# An error report file with more information is saved as:
# /tmp/hsperfdata_hadoop/hs_err_pid21011.log
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 883, in send_command
    response = connection.send_command(command)
  File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1040, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "<stdin>", line 2, in <module>
  File "/usr/lib/spark-current/python/pyspark/ml/base.py", line 64, in fit
    return self._fit(dataset)
  File "/usr/lib/spark-current/python/pyspark/ml/tuning.py", line 238, in _fit
    metric = eva.evaluate(model.transform(validation, epm[j]))
  File "/usr/lib/spark-current/python/pyspark/ml/evaluation.py", line 69, in evaluate
    return self._evaluate(dataset)
  File "/usr/lib/spark-current/python/pyspark/ml/evaluation.py", line 99, in _evaluate
    return self._java_obj.evaluate(dataset._jdf)
  File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/java_gateway.py", line 1133, in __call__
  File "/usr/lib/spark-current/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/lib/spark-current/python/lib/py4j-0.10.4-src.zip/py4j/protocol.py", line 327, in get_return_value
py4j.protocol.Py4JError: An error occurred while calling o67.evaluate

并且运行free -m的结果是：

              total        used        free      shared  buff/cache   available
Mem:          15886       10256        3723         227        1906        4799
Swap:          2047        2036          11

请同时查看完整的JVM /tmp/hsperfdata_hadoop/hs_err_pid21011.log： https://textsaver.flap.tv/lists/1wse

是否由于缺少驱动程序记忆而应该增加实例RAM？任何帮助都会非常感激，因为我已经解决了很长时间的问题。非常感谢你！

Spark：Java Runtime Environment没有足够的内存来继续

0 个答案: