为什么Spark任务无法在内存中添加rdd?

时间:2018-08-22 07:24:32

标签: apache-spark

计算小数据时不会出现此问题。任务运行时会出现少量数据,RDD无法放入内存,导致后续数据处理出现异常--- indexOutOfBoundsException。

这个问题困扰了我几天。请帮助我,谢谢。

val conf =new SparkConf().setAppName("tdidf")
     .setMaster("local[2]")

    val sc =new SparkContext(conf)

    val dim = math.pow(2,20).toInt

    val hashingTF =new HashingTF(dim)

    val lines = sc.textFile("/home/lb/bigtxt1/*").filter(!_.trim.equals(""))

    val pairs =lines.collect()

    //定义过滤器
    val filter =new StopRecognition()
    filter.insertStopNatures("w")//过滤掉标点

    val tf_num_pairs = lines map{
      l =>
          hashingTF.transform(ToAnalysis.parse(l).recognition(filter).toStringWithOutNature(" ").split(" ").toSeq)
    }
    tf_num_pairs.persist(StorageLevel.MEMORY_AND_DISK_SER)
   // tf_num_pairs.foreach(println)
    //计算idf
    val idf2 = new IDF().fit(tf_num_pairs)
    //计算tfidf向量
    val tfidf2 = idf2.transform(tf_num_pairs)
    //得到所有行的特征向量
    val vectorArr = tfidf2.collect()
18/08/22 14:32:09 INFO TaskSetManager: Starting task 0.0 in stage 1.0 (TID 103, localhost, executor driver, partition 0, PROCESS_LOCAL, 4879 bytes)
18/08/22 14:32:09 INFO TaskSetManager: Starting task 1.0 in stage 1.0 (TID 104, localhost, executor driver, partition 1, PROCESS_LOCAL, 4876 bytes)
18/08/22 14:32:09 INFO Executor: Running task 0.0 in stage 1.0 (TID 103)
18/08/22 14:32:09 INFO Executor: Running task 1.0 in stage 1.0 (TID 104)
18/08/22 14:32:09 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888317156269327_新建 Microsoft Word 文档.txt:0+37098
18/08/22 14:32:09 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888177631454919_yantao试修改0 - 副本.txt:0+30651
18/08/22 14:32:09 INFO MyStaticValue: init version.number to env value is : 2.11.11
18/08/22 14:32:09 INFO MyStaticValue: init copyright.string to env value is : Copyright 2002-2017, LAMP/EPFL
18/08/22 14:32:09 INFO MyStaticValue: init osgi.version.number to env value is : 2.11.11.v20170413-090219-8a413ba7cc
18/08/22 14:32:09 INFO MyStaticValue: init maven.version.number to env value is : 2.11.11
18/08/22 14:32:09 INFO File2Stream: path to stream library/ambiguity.dic
18/08/22 14:32:09 ERROR AmbiguityLibrary: Init ambiguity library error :org.ansj.exception.LibraryException:  path :library/ambiguity.dic file:/home/lb/IdeaProjects/SimpleA/library/ambiguity.dic not found or can not to read, path: library/ambiguity.dic
18/08/22 14:32:09 INFO File2Stream: path to stream library/ambiguity.dic
18/08/22 14:32:09 ERROR AmbiguityLibrary: Init ambiguity library error :org.ansj.exception.LibraryException:  path :library/ambiguity.dic file:/home/lb/IdeaProjects/SimpleA/library/ambiguity.dic not found or can not to read, path: library/ambiguity.dic
18/08/22 14:32:09 INFO File2Stream: path to stream library/default.dic
18/08/22 14:32:09 ERROR DicLibrary: Init ambiguity library error :org.ansj.exception.LibraryException:  path :library/default.dic file:/home/lb/IdeaProjects/SimpleA/library/default.dic not found or can not to read, path: library/default.dic
18/08/22 14:32:10 INFO DATDictionary: init core library ok use time : 479
18/08/22 14:32:10 INFO NgramLibrary: init ngram ok use time :763
18/08/22 14:32:11 INFO MemoryStore: Block rdd_3_0 stored as bytes in memory (estimated size 61.5 KB, free 310.5 MB)
18/08/22 14:32:11 INFO BlockManagerInfo: Added rdd_3_0 in memory on 192.168.2.28:41065 (size: 61.5 KB, free: 310.7 MB)
18/08/22 14:32:11 INFO MemoryStore: Block rdd_3_1 stored as bytes in memory (estimated size 59.4 KB, free 310.4 MB)
18/08/22 14:32:11 INFO BlockManagerInfo: Added rdd_3_1 in memory on 192.168.2.28:41065 (size: 59.4 KB, free: 310.7 MB)
18/08/22 14:32:12 INFO Executor: Finished task 1.0 in stage 1.0 (TID 104). 1901 bytes result sent to driver
18/08/22 14:32:12 INFO Executor: Finished task 0.0 in stage 1.0 (TID 103). 1901 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 2.0 in stage 1.0 (TID 105, localhost, executor driver, partition 2, PROCESS_LOCAL, 4860 bytes)
18/08/22 14:32:12 INFO Executor: Running task 2.0 in stage 1.0 (TID 105)
18/08/22 14:32:12 INFO TaskSetManager: Starting task 3.0 in stage 1.0 (TID 106, localhost, executor driver, partition 3, PROCESS_LOCAL, 4858 bytes)
18/08/22 14:32:12 INFO Executor: Running task 3.0 in stage 1.0 (TID 106)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888070113070312_查重3.txt:0+36682
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888057954518326_最终版.txt:0+37267
18/08/22 14:32:12 INFO TaskSetManager: Finished task 0.0 in stage 1.0 (TID 103) in 2806 ms on localhost (executor driver) (1/103)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 1.0 in stage 1.0 (TID 104) in 2805 ms on localhost (executor driver) (2/103)
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_2 stored as bytes in memory (estimated size 58.8 KB, free 310.4 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_2 in memory on 192.168.2.28:41065 (size: 58.8 KB, free: 310.6 MB)
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_3 stored as bytes in memory (estimated size 65.0 KB, free 310.3 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_3 in memory on 192.168.2.28:41065 (size: 65.0 KB, free: 310.5 MB)
18/08/22 14:32:12 INFO Executor: Finished task 3.0 in stage 1.0 (TID 106). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO Executor: Finished task 2.0 in stage 1.0 (TID 105). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 4.0 in stage 1.0 (TID 107, localhost, executor driver, partition 4, PROCESS_LOCAL, 4892 bytes)
18/08/22 14:32:12 INFO Executor: Running task 4.0 in stage 1.0 (TID 107)
18/08/22 14:32:12 INFO TaskSetManager: Starting task 5.0 in stage 1.0 (TID 108, localhost, executor driver, partition 5, PROCESS_LOCAL, 4877 bytes)
18/08/22 14:32:12 INFO Executor: Running task 5.0 in stage 1.0 (TID 108)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 3.0 in stage 1.0 (TID 106) in 117 ms on localhost (executor driver) (3/103)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 2.0 in stage 1.0 (TID 105) in 119 ms on localhost (executor driver) (4/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888053638429143_2016届开题报告 20120402117 程虹霖.txt:0+47679
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888063397110050_Revised2012070836赵晶晶.txt:0+33154
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_5 stored as bytes in memory (estimated size 70.3 KB, free 310.2 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_5 in memory on 192.168.2.28:41065 (size: 70.3 KB, free: 310.5 MB)
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_4 stored as bytes in memory (estimated size 59.1 KB, free 310.2 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_4 in memory on 192.168.2.28:41065 (size: 59.1 KB, free: 310.4 MB)
18/08/22 14:32:12 INFO Executor: Finished task 5.0 in stage 1.0 (TID 108). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 6.0 in stage 1.0 (TID 109, localhost, executor driver, partition 6, PROCESS_LOCAL, 4869 bytes)
18/08/22 14:32:12 INFO Executor: Running task 6.0 in stage 1.0 (TID 109)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 5.0 in stage 1.0 (TID 108) in 84 ms on localhost (executor driver) (5/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888095555698219_论文检测文件.txt:0+41786
18/08/22 14:32:12 INFO Executor: Finished task 4.0 in stage 1.0 (TID 107). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 7.0 in stage 1.0 (TID 110, localhost, executor driver, partition 7, PROCESS_LOCAL, 4863 bytes)
18/08/22 14:32:12 INFO Executor: Running task 7.0 in stage 1.0 (TID 110)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 4.0 in stage 1.0 (TID 107) in 91 ms on localhost (executor driver) (6/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888114911814909_毕业论文.txt:0+19597
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_7 stored as bytes in memory (estimated size 34.8 KB, free 310.1 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_7 in memory on 192.168.2.28:41065 (size: 34.8 KB, free: 310.4 MB)
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_6 stored as bytes in memory (estimated size 64.0 KB, free 310.1 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_6 in memory on 192.168.2.28:41065 (size: 64.0 KB, free: 310.3 MB)
18/08/22 14:32:12 INFO Executor: Finished task 7.0 in stage 1.0 (TID 110). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 8.0 in stage 1.0 (TID 111, localhost, executor driver, partition 8, PROCESS_LOCAL, 4875 bytes)
18/08/22 14:32:12 INFO Executor: Running task 8.0 in stage 1.0 (TID 111)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 7.0 in stage 1.0 (TID 110) in 58 ms on localhost (executor driver) (7/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888334917283358_于亚男双学位论文.txt:0+41390
18/08/22 14:32:12 INFO Executor: Finished task 6.0 in stage 1.0 (TID 109). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 9.0 in stage 1.0 (TID 112, localhost, executor driver, partition 9, PROCESS_LOCAL, 4911 bytes)
18/08/22 14:32:12 INFO Executor: Running task 9.0 in stage 1.0 (TID 112)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 6.0 in stage 1.0 (TID 109) in 76 ms on localhost (executor driver) (8/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888242730087353_周文杰120403021013某货车升降尾板液压系统设计.txt:0+52077
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_8 stored as bytes in memory (estimated size 64.7 KB, free 310.0 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_8 in memory on 192.168.2.28:41065 (size: 64.7 KB, free: 310.2 MB)
18/08/22 14:32:12 WARN BlockManager: Putting block rdd_3_9 failed due to an exception
18/08/22 14:32:12 WARN BlockManager: Block rdd_3_9 could not be removed as it was not found on disk or in memory
18/08/22 14:32:12 INFO Executor: Finished task 8.0 in stage 1.0 (TID 111). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 10.0 in stage 1.0 (TID 113, localhost, executor driver, partition 10, PROCESS_LOCAL, 4942 bytes)
18/08/22 14:32:12 INFO Executor: Running task 10.0 in stage 1.0 (TID 113)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 8.0 in stage 1.0 (TID 111) in 85 ms on localhost (executor driver) (9/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888131719508548_20121308041_基于iOS端通讯协作平台Vmoso产品-地点模块的功能测试 - 副本.txt:0+43333
18/08/22 14:32:12 ERROR Executor: Exception in task 9.0 in stage 1.0 (TID 112)
java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at org.ansj.domain.Result.toStringWithOutNature(Result.java:78)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:36)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:35)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:372)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_10 stored as bytes in memory (estimated size 113.1 KB, free 309.9 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_10 in memory on 192.168.2.28:41065 (size: 113.1 KB, free: 310.1 MB)
18/08/22 14:32:12 INFO Executor: Finished task 10.0 in stage 1.0 (TID 113). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 11.0 in stage 1.0 (TID 114, localhost, executor driver, partition 11, PROCESS_LOCAL, 4873 bytes)
18/08/22 14:32:12 INFO Executor: Running task 11.0 in stage 1.0 (TID 114)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 10.0 in stage 1.0 (TID 113) in 126 ms on localhost (executor driver) (10/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888162242261727_安亚楠 论文二稿.txt:0+27499
18/08/22 14:32:12 INFO MemoryStore: Block rdd_3_11 stored as bytes in memory (estimated size 41.6 KB, free 309.9 MB)
18/08/22 14:32:12 INFO BlockManagerInfo: Added rdd_3_11 in memory on 192.168.2.28:41065 (size: 41.6 KB, free: 310.1 MB)
18/08/22 14:32:12 INFO TaskSetManager: Starting task 12.0 in stage 1.0 (TID 115, localhost, executor driver, partition 12, PROCESS_LOCAL, 4879 bytes)
18/08/22 14:32:12 INFO Executor: Running task 12.0 in stage 1.0 (TID 115)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888116512963837_20120510630427谢睿臻4。2.txt:0+45352
18/08/22 14:32:12 INFO Executor: Finished task 11.0 in stage 1.0 (TID 114). 1815 bytes result sent to driver
18/08/22 14:32:12 INFO TaskSetManager: Starting task 13.0 in stage 1.0 (TID 116, localhost, executor driver, partition 13, PROCESS_LOCAL, 4873 bytes)
18/08/22 14:32:12 INFO Executor: Running task 13.0 in stage 1.0 (TID 116)
18/08/22 14:32:12 INFO TaskSetManager: Finished task 11.0 in stage 1.0 (TID 114) in 57 ms on localhost (executor driver) (11/103)
18/08/22 14:32:12 INFO HadoopRDD: Input split: file:/home/lb/bigtxt1/1888246595789207_张磊论文完整版1.txt:0+40760
18/08/22 14:32:12 WARN TaskSetManager: Lost task 9.0 in stage 1.0 (TID 112, localhost, executor driver): java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at org.ansj.domain.Result.toStringWithOutNature(Result.java:78)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:36)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:35)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:372)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

18/08/22 14:32:12 ERROR TaskSetManager: Task 9 in stage 1.0 failed 1 times; aborting job
18/08/22 14:32:13 INFO TaskSchedulerImpl: Cancelling stage 1
18/08/22 14:32:13 INFO MemoryStore: Block rdd_3_13 stored as bytes in memory (estimated size 60.1 KB, free 309.8 MB)
18/08/22 14:32:13 INFO BlockManagerInfo: Added rdd_3_13 in memory on 192.168.2.28:41065 (size: 60.1 KB, free: 310.0 MB)
18/08/22 14:32:13 INFO TaskSchedulerImpl: Stage 1 was cancelled
18/08/22 14:32:13 INFO Executor: Executor is trying to kill task 12.0 in stage 1.0 (TID 115), reason: stage cancelled
18/08/22 14:32:13 INFO Executor: Executor is trying to kill task 13.0 in stage 1.0 (TID 116), reason: stage cancelled
18/08/22 14:32:13 INFO MemoryStore: Block rdd_3_12 stored as bytes in memory (estimated size 68.8 KB, free 309.7 MB)
18/08/22 14:32:13 INFO DAGScheduler: ShuffleMapStage 1 (treeAggregate at IDF.scala:54) failed in 3.387 s due to Job aborted due to stage failure: Task 9 in stage 1.0 failed 1 times, most recent failure: Lost task 9.0 in stage 1.0 (TID 112, localhost, executor driver): java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at org.ansj.domain.Result.toStringWithOutNature(Result.java:78)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:36)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:35)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:372)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
18/08/22 14:32:13 INFO BlockManagerInfo: Added rdd_3_12 in memory on 192.168.2.28:41065 (size: 68.8 KB, free: 310.0 MB)
18/08/22 14:32:13 INFO DAGScheduler: Job 1 failed: treeAggregate at IDF.scala:54, took 3.462292 s
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 1.0 failed 1 times, most recent failure: Lost task 9.0 in stage 1.0 (TID 112, localhost, executor driver): java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at org.ansj.domain.Result.toStringWithOutNature(Result.java:78)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:36)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:35)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:372)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)

Driver stacktrace:
    at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1533)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1521)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1520)
    at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
    at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1520)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
    at scala.Option.foreach(Option.scala:257)
    at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1748)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1703)
    at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1692)
    at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
    at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
    at org.apache.spark.SparkContext.runJob(SparkContext.scala:2126)
    at org.apache.spark.rdd.RDD$$anonfun$reduce$1.apply(RDD.scala:1026)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.reduce(RDD.scala:1008)
    at org.apache.spark.rdd.RDD$$anonfun$treeAggregate$1.apply(RDD.scala:1151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
    at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
    at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
    at org.apache.spark.rdd.RDD.treeAggregate(RDD.scala:1128)
    at org.apache.spark.mllib.feature.IDF.fit(IDF.scala:54)
    at main.scala.lb.spark.SimpleA$.main(SimpleA.scala:43)
    at main.scala.lb.spark.SimpleA.main(SimpleA.scala)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
Caused by: java.lang.IndexOutOfBoundsException: Index: 0, Size: 0
    at java.util.ArrayList.rangeCheck(ArrayList.java:657)
    at java.util.ArrayList.get(ArrayList.java:433)
    at org.ansj.domain.Result.toStringWithOutNature(Result.java:78)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:36)
    at main.scala.lb.spark.SimpleA$$anonfun$2.apply(SimpleA.scala:35)
    at scala.collection.Iterator$$anon$11.next(Iterator.scala:410)
    at org.apache.spark.storage.memory.MemoryStore.putIteratorAsBytes(MemoryStore.scala:372)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1055)
    at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
    at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
    at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
    at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
    at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
    at org.apache.spark.scheduler.Task.run(Task.scala:109)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
    at java.lang.Thread.run(Thread.java:748)
18/08/22 14:32:13 INFO SparkContext: Invoking stop() from shutdown hook
18/08/22 14:32:13 INFO SparkUI: Stopped Spark web UI at http://192.168.2.28:4040
18/08/22 14:32:13 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint stopped!
18/08/22 14:32:13 INFO Executor: Executor killed task 13.0 in stage 1.0 (TID 116), reason: stage cancelled
18/08/22 14:32:13 INFO Executor: Executor killed task 12.0 in stage 1.0 (TID 115), reason: stage cancelled
18/08/22 14:32:13 INFO MemoryStore: MemoryStore cleared
18/08/22 14:32:13 INFO BlockManager: BlockManager stopped
18/08/22 14:32:13 INFO BlockManagerMaster: BlockManagerMaster stopped
18/08/22 14:32:13 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator stopped!
18/08/22 14:32:13 INFO SparkContext: Successfully stopped SparkContext
18/08/22 14:32:13 INFO ShutdownHookManager: Shutdown hook called
18/08/22 14:32:13 INFO ShutdownHookManager: Deleting directory /tmp/spark-3eb19991-aca3-42e9-aa2a-885fb1429225

0 个答案:

没有答案