Apache Spark精度因每次运行而异,并且有时会出现运行时异常

时间:2017-04-27 15:21:00

标签: scala apache-spark apache-spark-mllib random-forest logistic-regression

这是Github link, 如果有人想要重现它。由于某种原因,每次运行它时,该程序都会返回不同的值。这实际上是我想要查看的Github项目。有一部分数据集丢失但我设法在没有它的情况下运行它。问题是,它有时运行正常但每次都返回不同的准确度。有时,我得到Assertion异常和不支持的操作异常。有没有人知道它为什么会发生?

我在单独的Spark MLlib管道上使用Logistic回归和随机森林运行装袋。它运行良好但每次都返回不同的准确度和混淆矩阵。有时它抛出一个异常,下面给出了栈跟踪。

    Dataset size: 2
    B- Sample size: 5                                                                                     
    17/04/28 13:19:43 INFO LBFGS: Step Size: 0.7559                                                       
    17/04/28 13:19:43 INFO LBFGS: Val and Grad Norm: 0.143776 (rel: 0.793) 0.160762
    17/04/28 13:19:43 INFO LBFGS: Step Size: 1.000
    17/04/28 13:19:43 INFO LBFGS: Val and Grad Norm: 0.127285 (rel: 0.115) 0.0815899
    17/04/28 13:19:44 INFO LBFGS: Step Size: 1.000
    17/04/28 13:19:44 INFO LBFGS: Val and Grad Norm: 0.120321 (rel: 0.0547) 0.0207179
    17/04/28 13:19:45 INFO LBFGS: Step Size: 1.000                                                        
    17/04/28 13:19:45 INFO LBFGS: Val and Grad Norm: 0.119759 (rel: 0.00467) 0.00553480
    17/04/28 13:19:46 INFO LBFGS: Step Size: 1.000                                                        
    17/04/28 13:19:46 INFO LBFGS: Val and Grad Norm: 0.119721 (rel: 0.000315) 0.00214368
    17/04/28 13:19:46 INFO LBFGS: Step Size: 1.000
    17/04/28 13:19:46 INFO LBFGS: Val and Grad Norm: 0.119716 (rel: 3.85e-05) 0.000959314
    17/04/28 13:19:47 INFO LBFGS: Step Size: 1.000
    17/04/28 13:19:47 INFO LBFGS: Val and Grad Norm: 0.119715 (rel: 9.22e-06) 0.000185495
    17/04/28 13:19:47 INFO LBFGS: Step Size: 1.000
    17/04/28 13:19:47 INFO LBFGS: Val and Grad Norm: 0.119715 (rel: 4.08e-07) 2.80789e-05
    17/04/28 13:19:48 INFO LBFGS: Step Size: 1.000
    17/04/28 13:19:48 INFO LBFGS: Val and Grad Norm: 0.119715 (rel: 1.01e-08) 1.58237e-06
    Dataset size: 2
    B- Sample size: 0                                                                                     
    Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
at org.apache.spark.rdd.RDD.first(RDD.scala:1191)
at org.apache.spark.mllib.regression.GeneralizedLinearAlgorithm.run(GeneralizedLinearAlgorithm.scala:167)
at org.apache.spark.ml.classification.BaggedLogisticRegression$$anonfun$train$1.apply(BaggedLogisticRegression.scala:123)
at org.apache.spark.ml.classification.BaggedLogisticRegression$$anonfun$train$1.apply(BaggedLogisticRegression.scala:99)
at scala.collection.immutable.Range.foreach(Range.scala:141)
at org.apache.spark.ml.classification.BaggedLogisticRegression.train(BaggedLogisticRegression.scala:99)
at org.apache.spark.ml.classification.BaggedLogisticRegression.train(BaggedLogisticRegression.scala:63)
at org.apache.spark.ml.impl.estimator.Predictor.fit(Predictor.scala:102)
at org.apache.spark.ml.impl.estimator.Predictor.fit(Predictor.scala:82)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:118)
at org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:114)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
at scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:114)
at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:79)
at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:68)
at org.apache.spark.ml.Estimator$$anonfun$fit$1.apply(Estimator.scala:68)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
at org.apache.spark.ml.Estimator.fit(Estimator.scala:68)
at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:110)
at org.apache.spark.ml.tuning.CrossValidator$$anonfun$fit$1.apply(CrossValidator.scala:105)
at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:105)
at org.apache.spark.ml.tuning.CrossValidator.fit(CrossValidator.scala:78)
at org.apache.spark.ml.Estimator.fit(Estimator.scala:44)
at com.arvind.majorproject.CrossValidation.CrossValidation$.crossValidate(CrossValidation.scala:124)
at com.arvind.majorproject.main.Main$.main(Main.scala:140)
at com.arvind.majorproject.main.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:569)
at   org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:166)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:189)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:110)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Run ERROR: Aborting.

0 个答案:

没有答案