pyspark中的Java Out-of-Memory错误

时间:2016-08-15 02:35:34

标签: apache-spark pyspark

我的问题很简单:当我在RandomForest.trainRegressor中运行pyspark时,JVM的内存不足。我正在使用大约3GB的训练数据,77个功能,并且numTrees设置为15.如果我将numTrees设置为15或更多,则训练失败并出现内存不足错误(如下文):

error info:
model = RandomForest.trainRegressor(trainingData, categoricalFeaturesInfo={},numTrees=15, featureSubsetStrategy="sqrt",impurity='variance', maxDepth=20, maxBins=14) #variance
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 412, in trainRegressor
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/tree.py", line 270, in _train
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 130, in callMLlibFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/pyspark.zip/pyspark/mllib/common.py", line 123, in callJavaFunc
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__
File "/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o81.trainRandomForestModel.
: java.lang.OutOfMemoryError: Java heap space
    at java.util.Arrays.copyOf(Arrays.java:2271)
    at java.io.ByteArrayOutputStream.grow(ByteArrayOutputStream.java:118)
    at java.io.ByteArrayOutputStream.ensureCapacity(ByteArrayOutputStream.java:93)
    at java.io.ByteArrayOutputStream.write(ByteArrayOutputStream.java:153)
    at java.io.ObjectOutputStream$BlockDataOutputStream.drain(ObjectOutputStream.java:1876)
    at java.io.ObjectOutputStream$BlockDataOutputStream.setBlockDataMode(ObjectOutputStream.java:1785)
    at java.io.ObjectOutputStream.writeObject0(ObjectOutputStream.java:1188)
    at java.io.ObjectOutputStream.writeObject(ObjectOutputStream.java:347)
    at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:44)
    at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
    at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
    at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
    at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
    at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:703)
    at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1.apply(RDD.scala:702)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
    at org.apache.spark.rdd.RDD.mapPartitions(RDD.scala:702)
    at org.apache.spark.mllib.tree.DecisionTree$.findBestSplits(DecisionTree.scala:625)
    at org.apache.spark.mllib.tree.RandomForest.run(RandomForest.scala:235)
    at org.apache.spark.mllib.tree.RandomForest$.trainRegressor(RandomForest.scala:380)
    at org.apache.spark.mllib.api.python.PythonMLLibAPI.trainRandomForestModel(PythonMLLibAPI.scala:744)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:606)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
    at py4j.Gateway.invoke(Gateway.java:259)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)

我正在使用spark 1.5,我的参数是:

spark-submit --master yarn-client --conf spark.cassandra.connection.host=x.x.x.x \
    --jars /home/retail/packages/spark_cassandra_tool-1.0.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hive/lib/HiveAuthHook.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-common.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-client.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/hbase/hbase-server.jar,/opt/cloudera/parcels/CDH-5.6.0-1.cdh5.6.0.p0.45/lib/spark/lib/spark-examples.jar \
    --num-executors 30 --executor-cores 4 \
    --executor-memory 20000M --driver-memory 200g \
    --conf spark.yarn.executor.memoryOverhead=5000 \
    --conf spark.kryoserializer.buffer.max=2000m \
    --conf spark.kryoserializer.buffer=40m \
    --conf spark.driver.extraJavaOptions=\"-Xms2048m -Xmx2048m -XX:+DisableExplicitGC -Dcom.sun.management.jmxremote -XX:PermSize=512m -XX:MaxPermSize=2048m -XX:MaxDirectMemorySize=5g\" \
    --conf spark.driver.maxResultSize=10g \
    --conf spark.port.maxRetries=100
  1. 在我看来,树木应该是连续训练的,为什么它们的数量会产生更高的内存负荷呢?

  2. 如果我想成功训练300棵树,我应该如何设置spark参数以正确使用RandomForest.trainRegressor功能?

0 个答案:

没有答案