上下文:
在Spark 1.6.3 + Scala上使用GBTRegressor进行训练,功能集约为2000,行数约为100万。
使用10个EC2实例,每个实例有32个核心,最后我们最终得到30个执行器,每个执行器有10个核心。
HyperParameters / Parameters- Depth:3,迭代次数:100,StepSize:0.1,CacheNodeIds:true,CheckpointInterval:10
分区大小为5000
问题:
大约50%的时间我们总是在阶段1000附近发生堆栈溢出错误,其他时候训练师完成罚款。我们的GBTRegressor是PipelineStage
中的第一个Pipeline
。
示例异常(尽管错误发生在TreeNode
内的不同深度处):
[Stage 1014:=======================> (2514 + 2486) / 5000]
[Stage 1014:=============================> (3088 + 1912) / 5000]
[Stage 1014:==================================> (3666 + 1334) / 5000]
[Stage 1014:========================================> (4242 + 758) / 5000]
[Stage 1014:==============================================> (4819 + 181) / 5000]
Exception in thread "main" java.lang.StackOverflowError
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild$lzycompute(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode.containsChild(TreeNode.scala:86)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$4.apply(TreeNode.scala:280)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$class.foreach(Iterator.scala:742)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1194)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:59)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:104)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:48)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:308)
at scala.collection.AbstractIterator.to(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:300)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1194)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:287)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1194)
at org.apache.spark.sql.catalyst.trees.TreeNode.transformChildren(TreeNode.scala:321)
另:
[Stage 1044:=========================> (2692 + 2308) / 5000]
[Stage 1044:==============================> (3275 + 1725) / 5000]
[Stage 1044:====================================> (3851 + 1149) / 5000]
[Stage 1044:==========================================> (4430 + 570) / 5000]
[Stage 1044:=================================================>(4996 + 4) / 5000]
Exception in thread "main" java.lang.StackOverflowError
at java.lang.ClassLoader.defineClass1(Native Method)
at java.lang.ClassLoader.defineClass(ClassLoader.java:760)
at java.security.SecureClassLoader.defineClass(SecureClassLoader.java:142)
at java.net.URLClassLoader.defineClass(URLClassLoader.java:467)
at java.net.URLClassLoader.access$100(URLClassLoader.java:73)
at java.net.URLClassLoader$1.run(URLClassLoader.java:368)
at java.net.URLClassLoader$1.run(URLClassLoader.java:362)
at java.security.AccessController.doPrivileged(Native Method)
at java.net.URLClassLoader.findClass(URLClassLoader.java:361)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:86)
at org.apache.spark.sql.catalyst.expressions.BindReferences$$anonfun$bindReference$1.applyOrElse(BoundAttribute.scala:85)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
at org.apache.spark.sql.catalyst.trees.TreeNode$$anonfun$3.apply(TreeNode.scala:259)
我们提交(也尝试使用Xss提升线程堆栈大小但无效)
spark-submit --conf "spark.executor.extraJavaOptions=-XX:+UseConcMarkSweepGC -XX:+CMSIncrementalMode"