在PySpark mllib中调试ArrayOutOfBoundsException

时间:2015-10-27 13:57:34

标签: python apache-spark pyspark apache-spark-mllib

我试图在PySpark中开始使用mllib,并且在构建了数据集后,我试图运行基本的逻辑回归。

> train.take(4)

[LabeledPoint(0.0, (4,[485,909,1715,2023],[1.0,1.0,1.0,1.0])), LabeledPoint(0.0, (8,[39,147,344,1040,1489,1561,1776,1784],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])), LabeledPoint(0.0, (4,[485,994,1489,1715],[1.0,1.0,1.0,1.0])), LabeledPoint(1.0, (16,[154,162,165,344,455,500,594,706,774,803,819,988,1177,1438,1573,2023],[1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0]))]   

当我尝试以下操作时:

我得到了这个输出:

>>> model = LogisticRegressionWithSGD.train(data=train,iterations=10)

<a bunch of error-free INFO lines>

    15/10/27 09:40:02 INFO TaskSchedulerImpl: Adding task set 174.0 with 2 tasks                                                                         
15/10/27 09:40:02 INFO TaskSetManager: Starting task 0.0 in stage 174.0 (TID 131, localhost, PROCESS_LOCAL, 1274 bytes)                              
15/10/27 09:40:02 INFO TaskSetManager: Starting task 1.0 in stage 174.0 (TID 132, localhost, PROCESS_LOCAL, 1274 bytes)                              
15/10/27 09:40:02 INFO Executor: Running task 0.0 in stage 174.0 (TID 131)                                                                           
15/10/27 09:40:02 INFO Executor: Running task 1.0 in stage 174.0 (TID 132)                                                                           
15/10/27 09:40:02 INFO BlockManager: Found block rdd_149_1 locally                                                                                   
15/10/27 09:40:02 INFO BlockManager: Found block rdd_149_0 locally                                                                                   
15/10/27 09:40:02 ERROR Executor: Exception in task 1.0 in stage 174.0 (TID 132)                                                                     
java.lang.IllegalArgumentException: requirement failed                                                                                               
        at scala.Predef$.require(Predef.scala:221)                                                                                                   
        at org.apache.spark.mllib.optimization.LogisticGradient.compute(Gradient.scala:163)                                                          
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)                
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)                
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                                     
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                                     
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)                                                                               
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)                                                                            
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)                                                                
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)                                                                           
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)                                                               
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)                                                                          
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                        
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                        
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                        
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                        
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                   
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                   
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)                                                   
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)                                                            
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)                                                                           
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)                                                         
        at org.apache.spark.scheduler.Task.run(Task.scala:70)                                                                         
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)                                                      
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                                            
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                                            
        at java.lang.Thread.run(Thread.java:745)                                                                                      
15/10/27 09:40:02 ERROR Executor: Exception in task 0.0 in stage 174.0 (TID 131)                                                      
java.lang.ArrayIndexOutOfBoundsException: 485                                                                                         
        at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:136)                                                                    
        at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:106)                                                                    
        at org.apache.spark.mllib.optimization.LogisticGradient.compute(Gradient.scala:173)                                           
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192) 
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190) 
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                      
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                      
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)                                                                
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)                                                             
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)                                                 
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)                                                            
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)                                                
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)                                                           
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                        
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                        
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                        
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                        
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                   
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                   
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)                                                        [102/
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)                                                                      
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)                                                                                     
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)                                                                   
        at org.apache.spark.scheduler.Task.run(Task.scala:70)                                                                                   
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)                                                                
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                                                      
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                                                      
        at java.lang.Thread.run(Thread.java:745)                                                                                                
15/10/27 09:40:02 WARN TaskSetManager: Lost task 1.0 in stage 174.0 (TID 132, localhost): java.lang.IllegalArgumentException: requirement failed
        at scala.Predef$.require(Predef.scala:221)                                                                                              
        at org.apache.spark.mllib.optimization.LogisticGradient.compute(Gradient.scala:163)                                                     
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)           
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)           
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                                
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                                
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)                                                                          
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)                                                                       
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)                                                           
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)                                                                      
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)                                                          
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)                                                                     
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                                  
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                                  
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                                  
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                                  
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                             
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                             
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)                                                             
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)                                                                      
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)                                                                                     
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)                                                                   
        at org.apache.spark.scheduler.Task.run(Task.scala:70)                                                                                   
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)                                                                
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                                                      
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                                                      
        at java.lang.Thread.run(Thread.java:745)                                                                                           

15/10/27 09:40:02 ERROR TaskSetManager: Task 1 in stage 174.0 failed 1 times; aborting job                                                 
15/10/27 09:40:02 INFO TaskSchedulerImpl: Cancelling stage 174                                                                             
15/10/27 09:40:02 INFO TaskSchedulerImpl: Removed TaskSet 174.0, whose tasks have all completed, from pool                                 
15/10/27 09:40:02 INFO TaskSchedulerImpl: Stage 174 was cancelled                                                                          
15/10/27 09:40:02 INFO DAGScheduler: ResultStage 174 (treeAggregate at GradientDescent.scala:189) failed in 0.010 s                        
15/10/27 09:40:02 INFO DAGScheduler: Job 87 failed: treeAggregate at GradientDescent.scala:189, took 0.018160 s                            
15/10/27 09:40:02 INFO MapPartitionsRDD: Removing RDD 149 from persistence list                                                            
15/10/27 09:40:02 WARN TaskSetManager: Lost task 0.0 in stage 174.0 (TID 131, localhost): java.lang.ArrayIndexOutOfBoundsException: 485    
        at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:136)                                                                         
        at org.apache.spark.mllib.linalg.BLAS$.dot(BLAS.scala:106)                                                                         
        at org.apache.spark.mllib.optimization.LogisticGradient.compute(Gradient.scala:173)                                                
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)      
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)      
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                           
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                           
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)                                                                     
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)                                                                  
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)                                                      
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)                                                                 
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)                                                     
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)                                                                
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                             
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                             
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                             
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                             
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                        
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                        
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)                                                        
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)                                                                 
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)                                                                                
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)                                                              
        at org.apache.spark.scheduler.Task.run(Task.scala:70)                                                                              
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)                                                           
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                                                 
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                                                  [30/1964]
        at java.lang.Thread.run(Thread.java:745)                                                                                                     

15/10/27 09:40:02 INFO TaskSchedulerImpl: Removed TaskSet 174.0, whose tasks have all completed, from pool                                           
15/10/27 09:40:02 INFO BlockManager: Removing RDD 149                                                                                                
Traceback (most recent call last):                                                                                                                   
  File "<stdin>", line 1, in <module>                                                                                                                
  File "/usr/bin/spark/python/pyspark/mllib/classification.py", line 259, in train                                                                   
    return _regression_train_wrapper(train, LogisticRegressionModel, data, initialWeights)                                                           
  File "/usr/bin/spark/python/pyspark/mllib/regression.py", line 182, in _regression_train_wrapper                                                   
    data, _convert_to_vector(initial_weights))                                                                                                       
  File "/usr/bin/spark/python/pyspark/mllib/classification.py", line 257, in train                                                                   
    bool(intercept), bool(validateData))                                                                                                             
  File "/usr/bin/spark/python/pyspark/mllib/common.py", line 128, in callMLlibFunc                                                                   
    return callJavaFunc(sc, api, *args)                                                                                                              
  File "/usr/bin/spark/python/pyspark/mllib/common.py", line 121, in callJavaFunc                                                                    
    return _java2py(sc, func(*args))                                                                                                                 
  File "/usr/bin/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py", line 538, in __call__                                                  
  File "/usr/bin/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py", line 300, in get_return_value                                              
py4j.protocol.Py4JJavaError: An error occurred while calling o779.trainLogisticRegressionModelWithSGD.                                               
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 174.0 failed 1 times, most recent failure: Lost task 1.0 in stag
e 174.0 (TID 132, localhost): java.lang.IllegalArgumentException: requirement failed                                                                 
        at scala.Predef$.require(Predef.scala:221)                                                                                                   
        at org.apache.spark.mllib.optimization.LogisticGradient.compute(Gradient.scala:163)                                                          
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:192)                
        at org.apache.spark.mllib.optimization.GradientDescent$$anonfun$runMiniBatchSGD$1$$anonfun$1.apply(GradientDescent.scala:190)                
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                                     
        at scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)                                                     
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)                                                                               
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)                                                                            
        at scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)                                                                
        at scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)                                                                           
        at scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)                                                               
        at scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)                                                                          
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                                       
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$23.apply(RDD.scala:1075)                                                       
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                                  
        at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1$$anonfun$24.apply(RDD.scala:1076)                                                  
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                             
        at org.apache.spark.rdd.RDD$$anonfun$mapPartitions$1$$anonfun$apply$17.apply(RDD.scala:686)                                             
        at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)                                                             
        at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)                                                                      
        at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)                                                                                     
        at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:63)                                                                   
        at org.apache.spark.scheduler.Task.run(Task.scala:70)                                                                                   
        at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:213)                                                                
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)                                                      
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)                                                      
        at java.lang.Thread.run(Thread.java:745)                                                                                                

Driver stacktrace:                                                                                                                              
        at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1266)
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1257)                                         
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1256)                                         
        at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)                                                       
        at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)                                                                   
        at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1256)                                                          
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)                                 
        at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:730)                                 
        at scala.Option.foreach(Option.scala:236)                                                                                               
        at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:730)                                                  
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1450)                                           
        at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1411)                                           
        at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)

看起来我的LabeledPoints中有一个格式不正确,但索引超过了2000,485远远超出了界限。我该如何调试呢?

0 个答案:

没有答案