ML管道上的火花驱动程序内存问题

时间:2018-06-07 14:01:35

标签: apache-spark pyspark apache-spark-mllib

我正在运行logisticregression管道并且在这一行:

model = pipeline.fit(train_data)

我在RDDLossFunction阶段重复得到以下错误:

  

文件" /usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/ml/base.py" ;,第132行,适合     文件" /usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/ml/pipeline.py" ;,第109行,在_fit     文件" /usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/ml/base.py" ;,第132行,适合     文件" /usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/ml/wrapper.py" ;,第288行,在_fit     文件" /usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/ml/wrapper.py",第285行,在_fit_java中     文件" /usr/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/java_gateway.py" ;,第1160行,致电     文件" /usr/spark-2.3.0/python/lib/pyspark.zip/pyspark/sql/utils.py" ;,第63行,装饰     文件" /usr/spark-2.3.0/python/lib/py4j-0.10.6-src.zip/py4j/protocol.py" ;,第320行,在get_return_value中   py4j.protocol.Py4JJavaError:调用o23199.fit时发生错误。   :org.apache.spark.SparkException:作业因阶段失败而中止:9个任务(3.4 GB)的序列化结果总大小大于spark.driver.maxResultSize(3.0 GB)           在org.apache.spark.scheduler.DAGScheduler.org $ apache $ spark $ scheduler $ DAGScheduler $$ failJobAndIndependentStages(DAGScheduler.scala:1599)           在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1587)           在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ abortStage $ 1.apply(DAGScheduler.scala:1586)           在scala.collection.mutable.ResizableArray $ class.foreach(ResizableArray.scala:59)           在scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)           在org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1586)           在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831)           在org.apache.spark.scheduler.DAGScheduler $$ anonfun $ handleTaskSetFailed $ 1.apply(DAGScheduler.scala:831)           在scala.Option.foreach(Option.scala:257)           在org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:831)           在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1820)           在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1769)           在org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1758)           在org.apache.spark.util.EventLoop $$ anon $ 1.run(EventLoop.scala:48)           在org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:642)           在org.apache.spark.SparkContext.runJob(SparkContext.scala:2027)           在org.apache.spark.SparkContext.runJob(SparkContext.scala:2124)           在org.apache.spark.rdd.RDD $$ anonfun $ fold $ 1.apply(RDD.scala:1092)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)           在org.apache.spark.rdd.RDD.withScope(RDD.scala:363)           在org.apache.spark.rdd.RDD.fold(RDD.scala:1086)           在org.apache.spark.rdd.RDD $$ anonfun $ treeAggregate $ 1.apply(RDD.scala:1155)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:151)           在org.apache.spark.rdd.RDDOperationScope $ .withScope(RDDOperationScope.scala:112)           在org.apache.spark.rdd.RDD.withScope(RDD.scala:363)           在org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1131)           在org.apache.spark.ml.optim.loss.RDDLossFunction.calculate(RDDLossFunction.scala:61)           在org.apache.spark.ml.optim.loss.RDDLossFunction.calculate(RDDLossFunction.scala:47)           在breeze.optimize.CachedDiffFunction.calculate(CachedDiffFunction.scala:23)           at breeze.optimize.FirstOrderMinimizer.calculateObjective(FirstOrderMinimizer.scala:55)           在breeze.optimize.FirstOrderMinimizer.initialState(FirstOrderMinimizer.scala:48)           在breeze.optimize.FirstOrderMinimizer.iterations(FirstOrderMinimizer.scala:89)           在org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:798)           在org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:488)           在org.apache.spark.ml.classification.LogisticRegression.train(LogisticRegression.scala:278)           在org.apache.spark.ml.Predictor.fit(Predictor.scala:118)           at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)           at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)           at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)           at java.lang.reflect.Method.invoke(Method.java:498)           at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)           在py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)           在py4j.Gateway.invoke(Gateway.java:282)           at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)           在py4j.commands.CallCommand.execute(CallCommand.java:79)           在py4j.GatewayConnection.run(GatewayConnection.java:214)           在java.lang.Thread.run(Thread.java:748)

我已尝试将分区号从2001年降低到400,如https://translate.google.co.il/translate?hl=en&sl=zh-CN&u=http://bourneli.github.io/scala/spark/2016/09/21/spark-driver-maxResultSize-puzzle.html&prev=search中所示,但出现了相同的错误。 尝试将spark.driver.maxResultSize增加到3g - 也没有好处。

我有2个管道,一个用于准备数据,一个在整个数据集上完成,第二个包括LogisticRegression& labelconverter(IndexToString) - 是失败的那个。

我在独立群集上运行,3名工作人员,140GB组合,1名主人,15GB。

1 个答案:

答案 0 :(得分:0)

错误日志清楚地显示Total size of serialized results of 9 tasks (3.4 GB) is bigger than spark.driver.maxResultSize (3.0 GB)

您是否尝试将spark.driver.maxResultSize更改为大于3.4 G?