火花-stackoverflow错误-org.apache.spark.sql.catalyst.plans.QueryPlan

时间:2018-07-03 02:56:54

标签: java scala apache-spark hadoop apache-spark-sql

运行一个火花(v2.1.1)作业,该作业加入2个rdds(一个是S3的.txt文件,另一个是S3的木地板),然后该作业合并数据集(即,如果txt中存在PK,则获取每个PK的最新行)和拼花地板,然后从.txt中获取该行),并将新拼花地板写入S3。遇到此错误,但重新运行后效果很好。 .txt和镶木地板都有302列。 .txt有191行,镶木地板有156300行。有人知道原因吗?

   18/07/02 13:44:23 INFO TaskSchedulerImpl: Removed TaskSet 9.0, whose tasks have all completed, from pool
18/07/02 13:44:23 INFO DAGScheduler: ShuffleMapStage 9 (count at BaseJob.scala:93) finished in 0.055 s
18/07/02 13:44:23 INFO DAGScheduler: looking for newly runnable stages
18/07/02 13:44:23 INFO DAGScheduler: running: Set()
18/07/02 13:44:23 INFO DAGScheduler: waiting: Set(ResultStage 10)
18/07/02 13:44:23 INFO DAGScheduler: failed: Set()
18/07/02 13:44:23 INFO DAGScheduler: Submitting ResultStage 10 (MapPartitionsRDD[38] at count at BaseJob.scala:93), which has no missing parents
18/07/02 13:44:23 INFO MemoryStore: Block broadcast_13 stored as values in memory (estimated size 7.0 KB, free 911.2 MB)
18/07/02 13:44:23 INFO MemoryStore: Block broadcast_13_piece0 stored as bytes in memory (estimated size 3.7 KB, free 911.2 MB)
18/07/02 13:44:23 INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on 10.160.123.242:38105 (size: 3.7 KB, free: 912.1 MB)
18/07/02 13:44:23 INFO SparkContext: Created broadcast 13 from broadcast at DAGScheduler.scala:996
18/07/02 13:44:23 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 10 (MapPartitionsRDD[38] at count at BaseJob.scala:93)
18/07/02 13:44:23 INFO TaskSchedulerImpl: Adding task set 10.0 with 1 tasks
18/07/02 13:44:23 INFO TaskSetManager: Starting task 0.0 in stage 10.0 (TID 127, 10.160.122.226, executor 0, partition 0, NODE_LOCAL, 5948 bytes)
18/07/02 13:44:23 INFO BlockManagerInfo: Added broadcast_13_piece0 in memory on 10.160.122.226:38011 (size: 3.7 KB, free: 4.6 GB)
18/07/02 13:44:23 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 5 to 10.160.122.226:45952
18/07/02 13:44:23 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 5 is 163 bytes
18/07/02 13:44:23 INFO TaskSetManager: Finished task 0.0 in stage 10.0 (TID 127) in 18 ms on 10.160.122.226 (executor 0) (1/1)
18/07/02 13:44:23 INFO TaskSchedulerImpl: Removed TaskSet 10.0, whose tasks have all completed, from pool
18/07/02 13:44:23 INFO DAGScheduler: ResultStage 10 (count at BaseJob.scala:93) finished in 0.018 s
18/07/02 13:44:23 INFO DAGScheduler: Job 4 finished: count at BaseJob.scala:93, took 0.146750 s
18/07/02 13:44:23 INFO Delta3: Count of the clean and error records are 191,0
18/07/02 13:44:24 INFO InMemoryTableScanExec: Predicate (silfstatus#190246 = 1) generates partition filter: ((silfstatus.lowerBound#199645 <= 1) && (1 <= silfstatus.upperBound#199644))
18/07/02 13:44:24 INFO BlockManagerInfo: Removed broadcast_11_piece0 on 10.160.123.242:38105 in memory (size: 161.7 KB, free: 912.3 MB)
18/07/02 13:44:24 INFO BlockManagerInfo: Removed broadcast_11_piece0 on 10.160.122.226:38011 in memory (size: 161.7 KB, free: 4.6 GB)
18/07/02 13:44:24 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 10.160.123.242:38105 in memory (size: 3.7 KB, free: 912.3 MB)
18/07/02 13:44:24 INFO BlockManagerInfo: Removed broadcast_13_piece0 on 10.160.122.226:38011 in memory (size: 3.7 KB, free: 4.6 GB)
18/07/02 13:44:24 INFO BlockManagerInfo: Removed broadcast_12_piece0 on 10.160.123.242:38105 in memory (size: 4.4 KB, free: 912.3 MB)
18/07/02 13:44:24 INFO BlockManagerInfo: Removed broadcast_12_piece0 on 10.160.122.226:38011 in memory (size: 4.4 KB, free: 4.6 GB)
18/07/02 13:44:26 INFO SparkContext: Starting job: take at Utils.scala:28
18/07/02 13:44:26 INFO DAGScheduler: Got job 5 (take at Utils.scala:28) with 1 output partitions
18/07/02 13:44:26 INFO DAGScheduler: Final stage: ResultStage 11 (take at Utils.scala:28)
18/07/02 13:44:26 INFO DAGScheduler: Parents of final stage: List()
18/07/02 13:44:26 INFO DAGScheduler: Missing parents: List()
18/07/02 13:44:26 INFO DAGScheduler: Submitting ResultStage 11 (MapPartitionsRDD[42] at take at Utils.scala:28), which has no missing parents
18/07/02 13:44:26 INFO MemoryStore: Block broadcast_14 stored as values in memory (estimated size 611.5 KB, free 911.4 MB)
18/07/02 13:44:26 INFO MemoryStore: Block broadcast_14_piece0 stored as bytes in memory (estimated size 164.7 KB, free 911.2 MB)
18/07/02 13:44:26 INFO BlockManagerInfo: Added broadcast_14_piece0 in memory on 10.160.123.242:38105 (size: 164.7 KB, free: 912.1 MB)
18/07/02 13:44:26 INFO SparkContext: Created broadcast 14 from broadcast at DAGScheduler.scala:996
18/07/02 13:44:26 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 11 (MapPartitionsRDD[42] at take at Utils.scala:28)
18/07/02 13:44:26 INFO TaskSchedulerImpl: Adding task set 11.0 with 1 tasks
18/07/02 13:44:26 INFO TaskSetManager: Starting task 0.0 in stage 11.0 (TID 128, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 6578 bytes)
18/07/02 13:44:26 INFO BlockManagerInfo: Added broadcast_14_piece0 in memory on 10.160.122.226:38011 (size: 164.7 KB, free: 4.6 GB)
18/07/02 13:44:26 INFO TaskSetManager: Finished task 0.0 in stage 11.0 (TID 128) in 127 ms on 10.160.122.226 (executor 0) (1/1)
18/07/02 13:44:26 INFO TaskSchedulerImpl: Removed TaskSet 11.0, whose tasks have all completed, from pool
18/07/02 13:44:26 INFO DAGScheduler: ResultStage 11 (take at Utils.scala:28) finished in 0.127 s
18/07/02 13:44:26 INFO DAGScheduler: Job 5 finished: take at Utils.scala:28, took 0.152945 s
18/07/02 13:44:27 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 10.160.123.242:38105 in memory (size: 164.7 KB, free: 912.3 MB)
18/07/02 13:44:27 INFO BlockManagerInfo: Removed broadcast_14_piece0 on 10.160.122.226:38011 in memory (size: 164.7 KB, free: 4.6 GB)
18/07/02 13:44:27 INFO ContextCleaner: Cleaned accumulator 3394
18/07/02 13:44:27 INFO ContextCleaner: Cleaned accumulator 3395
18/07/02 13:44:28 INFO InMemoryTableScanExec: Predicate (silfstatus#190246 = 0) generates partition filter: ((silfstatus.lowerBound#201463 <= 0) && (0 <= silfstatus.upperBound#201462))
18/07/02 13:51:53 INFO Upsert$: =====Joining the source file and previous hive partition=====
18/07/02 13:51:53 INFO FileSourceStrategy: Pruning directories with:
18/07/02 13:51:53 INFO FileSourceStrategy: Post-Scan Filters:
18/07/02 13:51:53 INFO FileSourceStrategy: Output Data Schema: struct<>
18/07/02 13:51:53 INFO FileSourceStrategy: Pushed Filters:
18/07/02 13:51:53 INFO CodeGenerator: Code generated in 11.072558 ms
18/07/02 13:51:53 INFO MemoryStore: Block broadcast_15 stored as values in memory (estimated size 301.6 KB, free 911.7 MB)
18/07/02 13:51:53 INFO MemoryStore: Block broadcast_15_piece0 stored as bytes in memory (estimated size 25.6 KB, free 911.7 MB)
18/07/02 13:51:53 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 10.160.123.242:38105 (size: 25.6 KB, free: 912.2 MB)
18/07/02 13:51:53 INFO SparkContext: Created broadcast 15 from count at Data.scala:23
18/07/02 13:51:53 INFO FileSourceScanExec: Planning scan with bin packing, max size: 48443541 bytes, open cost is considered as scanning 4194304 bytes.
18/07/02 13:51:53 INFO SparkContext: Starting job: count at Data.scala:23
18/07/02 13:51:53 INFO DAGScheduler: Registering RDD 52 (count at Data.scala:23)
18/07/02 13:51:53 INFO DAGScheduler: Got job 6 (count at Data.scala:23) with 1 output partitions
18/07/02 13:51:53 INFO DAGScheduler: Final stage: ResultStage 13 (count at Data.scala:23)
18/07/02 13:51:53 INFO DAGScheduler: Parents of final stage: List(ShuffleMapStage 12)
18/07/02 13:51:53 INFO DAGScheduler: Missing parents: List(ShuffleMapStage 12)
18/07/02 13:51:53 INFO DAGScheduler: Submitting ShuffleMapStage 12 (MapPartitionsRDD[52] at count at Data.scala:23), which has no missing parents
18/07/02 13:51:53 INFO MemoryStore: Block broadcast_16 stored as values in memory (estimated size 9.4 KB, free 911.6 MB)
18/07/02 13:51:53 INFO MemoryStore: Block broadcast_16_piece0 stored as bytes in memory (estimated size 4.9 KB, free 911.6 MB)
18/07/02 13:51:54 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 10.160.123.242:38105 (size: 4.9 KB, free: 912.2 MB)
18/07/02 13:51:54 INFO SparkContext: Created broadcast 16 from broadcast at DAGScheduler.scala:996
18/07/02 13:51:54 INFO DAGScheduler: Submitting 4 missing tasks from ShuffleMapStage 12 (MapPartitionsRDD[52] at count at Data.scala:23)
18/07/02 13:51:54 INFO TaskSchedulerImpl: Adding task set 12.0 with 4 tasks
18/07/02 13:51:54 INFO TaskSetManager: Starting task 0.0 in stage 12.0 (TID 129, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 9110 bytes)
18/07/02 13:51:54 INFO TaskSetManager: Starting task 1.0 in stage 12.0 (TID 130, 10.160.122.226, executor 0, partition 1, PROCESS_LOCAL, 9110 bytes)
18/07/02 13:51:54 INFO TaskSetManager: Starting task 2.0 in stage 12.0 (TID 131, 10.160.122.226, executor 0, partition 2, PROCESS_LOCAL, 9110 bytes)
18/07/02 13:51:54 INFO TaskSetManager: Starting task 3.0 in stage 12.0 (TID 132, 10.160.122.226, executor 0, partition 3, PROCESS_LOCAL, 9110 bytes)
18/07/02 13:51:54 INFO BlockManagerInfo: Added broadcast_16_piece0 in memory on 10.160.122.226:38011 (size: 4.9 KB, free: 4.6 GB)
18/07/02 13:51:54 INFO BlockManagerInfo: Added broadcast_15_piece0 in memory on 10.160.122.226:38011 (size: 25.6 KB, free: 4.6 GB)
18/07/02 13:51:55 INFO TaskSetManager: Finished task 0.0 in stage 12.0 (TID 129) in 1169 ms on 10.160.122.226 (executor 0) (1/4)
18/07/02 13:51:55 INFO TaskSetManager: Finished task 3.0 in stage 12.0 (TID 132) in 1174 ms on 10.160.122.226 (executor 0) (2/4)
18/07/02 13:51:55 INFO TaskSetManager: Finished task 1.0 in stage 12.0 (TID 130) in 1394 ms on 10.160.122.226 (executor 0) (3/4)
18/07/02 13:51:56 INFO TaskSetManager: Finished task 2.0 in stage 12.0 (TID 131) in 2292 ms on 10.160.122.226 (executor 0) (4/4)
18/07/02 13:51:56 INFO TaskSchedulerImpl: Removed TaskSet 12.0, whose tasks have all completed, from pool
18/07/02 13:51:56 INFO DAGScheduler: ShuffleMapStage 12 (count at Data.scala:23) finished in 2.293 s
18/07/02 13:51:56 INFO DAGScheduler: looking for newly runnable stages
18/07/02 13:51:56 INFO DAGScheduler: running: Set()
18/07/02 13:51:56 INFO DAGScheduler: waiting: Set(ResultStage 13)
18/07/02 13:51:56 INFO DAGScheduler: failed: Set()
18/07/02 13:51:56 INFO DAGScheduler: Submitting ResultStage 13 (MapPartitionsRDD[55] at count at Data.scala:23), which has no missing parents
18/07/02 13:51:56 INFO MemoryStore: Block broadcast_17 stored as values in memory (estimated size 7.0 KB, free 911.6 MB)
18/07/02 13:51:56 INFO MemoryStore: Block broadcast_17_piece0 stored as bytes in memory (estimated size 3.7 KB, free 911.6 MB)
18/07/02 13:51:56 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on 10.160.123.242:38105 (size: 3.7 KB, free: 912.2 MB)
18/07/02 13:51:56 INFO SparkContext: Created broadcast 17 from broadcast at DAGScheduler.scala:996
18/07/02 13:51:56 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 13 (MapPartitionsRDD[55] at count at Data.scala:23)
18/07/02 13:51:56 INFO TaskSchedulerImpl: Adding task set 13.0 with 1 tasks
18/07/02 13:51:56 INFO TaskSetManager: Starting task 0.0 in stage 13.0 (TID 133, 10.160.122.226, executor 0, partition 0, NODE_LOCAL, 5949 bytes)
18/07/02 13:51:56 INFO BlockManagerInfo: Added broadcast_17_piece0 in memory on 10.160.122.226:38011 (size: 3.7 KB, free: 4.6 GB)
18/07/02 13:51:56 INFO MapOutputTrackerMasterEndpoint: Asked to send map output locations for shuffle 6 to 10.160.122.226:45952
18/07/02 13:51:56 INFO MapOutputTrackerMaster: Size of output statuses for shuffle 6 is 153 bytes
18/07/02 13:51:56 INFO TaskSetManager: Finished task 0.0 in stage 13.0 (TID 133) in 12 ms on 10.160.122.226 (executor 0) (1/1)
18/07/02 13:51:56 INFO TaskSchedulerImpl: Removed TaskSet 13.0, whose tasks have all completed, from pool
18/07/02 13:51:56 INFO DAGScheduler: ResultStage 13 (count at Data.scala:23) finished in 0.012 s
18/07/02 13:51:56 INFO DAGScheduler: Job 6 finished: count at Data.scala:23, took 2.315589 s
18/07/02 13:51:56 INFO SparkContext: Starting job: load at Data.scala:25
18/07/02 13:51:56 INFO DAGScheduler: Got job 7 (load at Data.scala:25) with 1 output partitions
18/07/02 13:51:56 INFO DAGScheduler: Final stage: ResultStage 14 (load at Data.scala:25)
18/07/02 13:51:56 INFO DAGScheduler: Parents of final stage: List()
18/07/02 13:51:56 INFO DAGScheduler: Missing parents: List()
18/07/02 13:51:56 INFO DAGScheduler: Submitting ResultStage 14 (MapPartitionsRDD[57] at load at Data.scala:25), which has no missing parents
18/07/02 13:51:56 INFO MemoryStore: Block broadcast_18 stored as values in memory (estimated size 74.6 KB, free 911.6 MB)
18/07/02 13:51:56 INFO MemoryStore: Block broadcast_18_piece0 stored as bytes in memory (estimated size 27.2 KB, free 911.5 MB)
18/07/02 13:51:56 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on 10.160.123.242:38105 (size: 27.2 KB, free: 912.2 MB)
18/07/02 13:51:56 INFO SparkContext: Created broadcast 18 from broadcast at DAGScheduler.scala:996
18/07/02 13:51:56 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 14 (MapPartitionsRDD[57] at load at Data.scala:25)
18/07/02 13:51:56 INFO TaskSchedulerImpl: Adding task set 14.0 with 1 tasks
18/07/02 13:51:56 INFO TaskSetManager: Starting task 0.0 in stage 14.0 (TID 134, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 6337 bytes)
18/07/02 13:51:56 INFO BlockManagerInfo: Added broadcast_18_piece0 in memory on 10.160.122.226:38011 (size: 27.2 KB, free: 4.6 GB)
18/07/02 13:51:56 INFO TaskSetManager: Finished task 0.0 in stage 14.0 (TID 134) in 295 ms on 10.160.122.226 (executor 0) (1/1)
18/07/02 13:51:56 INFO TaskSchedulerImpl: Removed TaskSet 14.0, whose tasks have all completed, from pool
18/07/02 13:51:56 INFO DAGScheduler: ResultStage 14 (load at Data.scala:25) finished in 0.295 s
18/07/02 13:51:56 INFO DAGScheduler: Job 7 finished: load at Data.scala:25, took 0.310932 s
18/07/02 13:51:57 INFO FileSourceStrategy: Pruning directories with:
18/07/02 13:51:57 INFO FileSourceStrategy: Post-Scan Filters:
18/07/02 13:51:57 INFO FileSourceStrategy: Output Data Schema: struct<row_id: string, created: timestamp, created_by: string, last_upd: timestamp, last_upd_by: string ... 300 more fields>
18/07/02 13:51:57 INFO FileSourceStrategy: Pushed Filters:
18/07/02 13:51:57 INFO MemoryStore: Block broadcast_19 stored as values in memory (estimated size 387.2 KB, free 911.2 MB)
18/07/02 13:51:57 INFO MemoryStore: Block broadcast_19_piece0 stored as bytes in memory (estimated size 33.7 KB, free 911.1 MB)
18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on 10.160.123.242:38105 (size: 33.7 KB, free: 912.2 MB)
18/07/02 13:51:57 INFO SparkContext: Created broadcast 19 from cache at Upsert.scala:25
18/07/02 13:51:57 INFO FileSourceScanExec: Planning scan with bin packing, max size: 48443541 bytes, open cost is considered as scanning 4194304 bytes.
18/07/02 13:51:57 INFO SparkContext: Starting job: take at Utils.scala:28
18/07/02 13:51:57 INFO DAGScheduler: Got job 8 (take at Utils.scala:28) with 1 output partitions
18/07/02 13:51:57 INFO DAGScheduler: Final stage: ResultStage 15 (take at Utils.scala:28)
18/07/02 13:51:57 INFO DAGScheduler: Parents of final stage: List()
18/07/02 13:51:57 INFO DAGScheduler: Missing parents: List()
18/07/02 13:51:57 INFO DAGScheduler: Submitting ResultStage 15 (MapPartitionsRDD[65] at take at Utils.scala:28), which has no missing parents
18/07/02 13:51:57 INFO MemoryStore: Block broadcast_20 stored as values in memory (estimated size 321.5 KB, free 910.8 MB)
18/07/02 13:51:57 INFO MemoryStore: Block broadcast_20_piece0 stored as bytes in memory (estimated size 93.0 KB, free 910.7 MB)
18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on 10.160.123.242:38105 (size: 93.0 KB, free: 912.1 MB)
18/07/02 13:51:57 INFO SparkContext: Created broadcast 20 from broadcast at DAGScheduler.scala:996
18/07/02 13:51:57 INFO DAGScheduler: Submitting 1 missing tasks from ResultStage 15 (MapPartitionsRDD[65] at take at Utils.scala:28)
18/07/02 13:51:57 INFO TaskSchedulerImpl: Adding task set 15.0 with 1 tasks
18/07/02 13:51:57 INFO TaskSetManager: Starting task 0.0 in stage 15.0 (TID 135, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 9035 bytes)
18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_20_piece0 in memory on 10.160.122.226:38011 (size: 93.0 KB, free: 4.6 GB)
18/07/02 13:51:57 INFO BlockManagerInfo: Added broadcast_19_piece0 in memory on 10.160.122.226:38011 (size: 33.7 KB, free: 4.6 GB)
18/07/02 13:52:05 INFO BlockManagerInfo: Added rdd_61_0 in memory on 10.160.122.226:38011 (size: 38.9 MB, free: 4.5 GB)
18/07/02 13:52:09 INFO BlockManagerInfo: Added rdd_63_0 in memory on 10.160.122.226:38011 (size: 38.9 MB, free: 4.5 GB)
18/07/02 13:52:09 INFO TaskSetManager: Finished task 0.0 in stage 15.0 (TID 135) in 11751 ms on 10.160.122.226 (executor 0) (1/1)
18/07/02 13:52:09 INFO TaskSchedulerImpl: Removed TaskSet 15.0, whose tasks have all completed, from pool
18/07/02 13:52:09 INFO DAGScheduler: ResultStage 15 (take at Utils.scala:28) finished in 11.751 s
18/07/02 13:52:09 INFO DAGScheduler: Job 8 finished: take at Utils.scala:28, took 11.772561 s
18/07/02 13:52:09 INFO CodeGenerator: Code generated in 185.277258 ms
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3459
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3452
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3456
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3455
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3458
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3450
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3460
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3449
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 10.160.123.242:38105 in memory (size: 27.2 KB, free: 912.1 MB)
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_18_piece0 on 10.160.122.226:38011 in memory (size: 27.2 KB, free: 4.5 GB)
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3462
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 10.160.123.242:38105 in memory (size: 3.7 KB, free: 912.1 MB)
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_17_piece0 on 10.160.122.226:38011 in memory (size: 3.7 KB, free: 4.5 GB)
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3451
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3684
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 10.160.123.242:38105 in memory (size: 25.6 KB, free: 912.1 MB)
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_15_piece0 on 10.160.122.226:38011 in memory (size: 25.6 KB, free: 4.5 GB)
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3453
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3457
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 10.160.123.242:38105 in memory (size: 4.9 KB, free: 912.2 MB)
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_16_piece0 on 10.160.122.226:38011 in memory (size: 4.9 KB, free: 4.5 GB)
18/07/02 13:52:09 INFO ContextCleaner: Cleaned shuffle 6
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3461
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 10.160.123.242:38105 in memory (size: 93.0 KB, free: 912.2 MB)
18/07/02 13:52:09 INFO BlockManagerInfo: Removed broadcast_20_piece0 on 10.160.122.226:38011 in memory (size: 93.0 KB, free: 4.5 GB)
18/07/02 13:52:09 INFO ContextCleaner: Cleaned accumulator 3454
18/07/02 13:52:39 INFO SparkContext: Starting job: run at ThreadPoolExecutor.java:1149
18/07/02 13:52:39 INFO DAGScheduler: Got job 9 (run at ThreadPoolExecutor.java:1149) with 4 output partitions
18/07/02 13:52:39 INFO DAGScheduler: Final stage: ResultStage 16 (run at ThreadPoolExecutor.java:1149)
18/07/02 13:52:39 INFO DAGScheduler: Parents of final stage: List()
18/07/02 13:52:39 INFO DAGScheduler: Missing parents: List()
18/07/02 13:52:39 INFO DAGScheduler: Submitting ResultStage 16 (MapPartitionsRDD[67] at run at ThreadPoolExecutor.java:1149), which has no missing parents
18/07/02 13:52:39 INFO MemoryStore: Block broadcast_21 stored as values in memory (estimated size 321.7 KB, free 911.3 MB)
18/07/02 13:52:39 INFO MemoryStore: Block broadcast_21_piece0 stored as bytes in memory (estimated size 93.0 KB, free 911.2 MB)
18/07/02 13:52:39 INFO BlockManagerInfo: Added broadcast_21_piece0 in memory on 10.160.123.242:38105 (size: 93.0 KB, free: 912.2 MB)
18/07/02 13:52:39 INFO SparkContext: Created broadcast 21 from broadcast at DAGScheduler.scala:996
18/07/02 13:52:39 INFO DAGScheduler: Submitting 4 missing tasks from ResultStage 16 (MapPartitionsRDD[67] at run at ThreadPoolExecutor.java:1149)
18/07/02 13:52:39 INFO TaskSchedulerImpl: Adding task set 16.0 with 4 tasks
18/07/02 13:52:39 INFO TaskSetManager: Starting task 0.0 in stage 16.0 (TID 136, 10.160.122.226, executor 0, partition 0, PROCESS_LOCAL, 9098 bytes)
18/07/02 13:52:39 INFO TaskSetManager: Starting task 1.0 in stage 16.0 (TID 137, 10.160.122.226, executor 0, partition 1, PROCESS_LOCAL, 9098 bytes)
18/07/02 13:52:39 INFO TaskSetManager: Starting task 2.0 in stage 16.0 (TID 138, 10.160.122.226, executor 0, partition 2, PROCESS_LOCAL, 9098 bytes)
18/07/02 13:52:39 INFO TaskSetManager: Starting task 3.0 in stage 16.0 (TID 139, 10.160.122.226, executor 0, partition 3, PROCESS_LOCAL, 9098 bytes)
18/07/02 13:52:39 INFO BlockManagerInfo: Added broadcast_21_piece0 in memory on 10.160.122.226:38011 (size: 93.0 KB, free: 4.5 GB)
18/07/02 13:52:39 INFO TaskSetManager: Finished task 0.0 in stage 16.0 (TID 136) in 47 ms on 10.160.122.226 (executor 0) (1/4)
18/07/02 13:52:46 INFO BlockManagerInfo: Added rdd_61_2 in memory on 10.160.122.226:38011 (size: 38.8 MB, free: 4.5 GB)
18/07/02 13:52:46 INFO BlockManagerInfo: Added rdd_61_3 in memory on 10.160.122.226:38011 (size: 38.7 MB, free: 4.4 GB)
18/07/02 13:52:46 INFO BlockManagerInfo: Added rdd_61_1 in memory on 10.160.122.226:38011 (size: 38.8 MB, free: 4.4 GB)
18/07/02 13:52:49 INFO BlockManagerInfo: Added rdd_63_2 in memory on 10.160.122.226:38011 (size: 38.8 MB, free: 4.3 GB)
18/07/02 13:52:49 INFO TaskSetManager: Finished task 2.0 in stage 16.0 (TID 138) in 10368 ms on 10.160.122.226 (executor 0) (2/4)
18/07/02 13:52:50 INFO BlockManagerInfo: Added rdd_63_3 in memory on 10.160.122.226:38011 (size: 38.7 MB, free: 4.3 GB)
18/07/02 13:52:50 INFO TaskSetManager: Finished task 3.0 in stage 16.0 (TID 139) in 10617 ms on 10.160.122.226 (executor 0) (3/4)
18/07/02 13:52:50 INFO BlockManagerInfo: Added rdd_63_1 in memory on 10.160.122.226:38011 (size: 38.8 MB, free: 4.3 GB)
18/07/02 13:52:50 INFO TaskSetManager: Finished task 1.0 in stage 16.0 (TID 137) in 10668 ms on 10.160.122.226 (executor 0) (4/4)
18/07/02 13:52:50 INFO TaskSchedulerImpl: Removed TaskSet 16.0, whose tasks have all completed, from pool
18/07/02 13:52:50 INFO DAGScheduler: ResultStage 16 (run at ThreadPoolExecutor.java:1149) finished in 10.669 s
18/07/02 13:52:50 INFO DAGScheduler: Job 9 finished: run at ThreadPoolExecutor.java:1149, took 10.684407 s
18/07/02 13:52:50 INFO CodeGenerator: Code generated in 7.746892 ms
18/07/02 13:52:50 INFO MemoryStore: Block broadcast_22 stored as values in memory (estimated size 19.0 MB, free 892.2 MB)
18/07/02 13:52:50 INFO MemoryStore: Block broadcast_22_piece0 stored as bytes in memory (estimated size 3.2 MB, free 889.0 MB)
18/07/02 13:52:50 INFO BlockManagerInfo: Added broadcast_22_piece0 in memory on 10.160.123.242:38105 (size: 3.2 MB, free: 909.0 MB)
18/07/02 13:52:50 INFO SparkContext: Created broadcast 22 from run at ThreadPoolExecutor.java:1149
18/07/02 13:52:50 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 10.160.123.242:38105 in memory (size: 93.0 KB, free: 909.0 MB)
18/07/02 13:52:50 INFO BlockManagerInfo: Removed broadcast_21_piece0 on 10.160.122.226:38011 in memory (size: 93.0 KB, free: 4.3 GB)  
        Exception in thread "main" java.lang.StackOverflowError
                at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$sameResult$1.apply(QueryPlan.scala:373)
                at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$sameResult$1.apply(QueryPlan.scala:373)
                at scala.runtime.Tuple2Zipped$$anonfun$forall$extension$1.apply(Tuple2Zipped.scala:101)
                at scala.runtime.Tuple2Zipped$$anonfun$forall$extension$1.apply(Tuple2Zipped.scala:101)
                at scala.runtime.Tuple2Zipped$$anonfun$exists$extension$1.apply(Tuple2Zipped.scala:92)
                at scala.runtime.Tuple2Zipped$$anonfun$exists$extension$1.apply(Tuple2Zipped.scala:90)
                at scala.collection.immutable.List.foreach(List.scala:381)
                at scala.runtime.Tuple2Zipped$.exists$extension(Tuple2Zipped.scala:90)
                at scala.runtime.Tuple2Zipped$.forall$extension(Tuple2Zipped.scala:101)
                at org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:373)
                at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$sameResult$1.apply(QueryPlan.scala:373)
                at org.apache.spark.sql.catalyst.plans.QueryPlan$$anonfun$sameResult$1.apply(QueryPlan.scala:373)
                at scala.runtime.Tuple2Zipped$$anonfun$forall$extension$1.apply(Tuple2Zipped.scala:101)
                at scala.runtime.Tuple2Zipped$$anonfun$forall$extension$1.apply(Tuple2Zipped.scala:101)
                at scala.runtime.Tuple2Zipped$$anonfun$exists$extension$1.apply(Tuple2Zipped.scala:92)
                at scala.runtime.Tuple2Zipped$$anonfun$exists$extension$1.apply(Tuple2Zipped.scala:90)
                at scala.collection.immutable.List.foreach(List.scala:381)
                at scala.runtime.Tuple2Zipped$.exists$extension(Tuple2Zipped.scala:90)
                at scala.runtime.Tuple2Zipped$.forall$extension(Tuple2Zipped.scala:101)
                at org.apache.spark.sql.catalyst.plans.QueryPlan.sameResult(QueryPlan.scala:373)

代码已运行:

object Upsert {
 val logger = Logger.getLogger(getClass.getName)
def finalDf(srcDf: DataFrame, partitionPath: Option[String], hiveSchema: StructType, pkList: List[String], srcSchema: StructType) = {
 logger.info(s"""=====Joining the source file and previous hive partition=====""")
 //val hiveCols = srcSchema.map(f => col(f.name))
 val srcCols = srcSchema.map(f => col("_" + f.name))
 val finalColsType = srcSchema.map(f =>
 if(f.dataType.simpleString.contains("decimal")) (f.name, DecimalType(31,8)) 
 else (f.name, f.dataType)
 )
 val finalCols = finalColsType.map(_._1)
 val srcPkList = pkList.map("_" + _)
 val hivedf = extract.Data.readHivePartition(sparkSession, partitionPath, hiveSchema).cache()
 val hiveCols = hivedf.dtypes.toList.map(n => (n._1, stringToStructTypeMapping(n._2)))
 val addedCols = finalColsType.toList.diff(hiveCols)
 val hivedfNew = addMultipleColToDF(hivedf, addedCols).select(finalCols.map(col(_)):_*).cache()
 val commonDataFilterCond = srcPkList
 .zip(pkList)
 .map{case(c1, c2) => (coalesce(col(c1), lit("null")) === coalesce(col(c2), lit("null")))}
 .reduce(_ && _)
isDfEmpty(hivedfNew) match {
 case true => srcDf
 case false => {
 val srcRename = srcDf.toDF(srcSchema.map("_" + _.name):_*)
 val joinData = srcRename.join(hivedfNew, commonDataFilterCond, "inner")
 val commonData = joinData.select(srcCols:_*)
 val currentData = srcRename.except(commonData).cache
 val prevData = hivedfNew.except(joinData.select(finalCols.map(col(_)):_*))
 currentData.unionAll(prevData).unionAll(commonData).toDF(finalCols:_*)
 }
 }
 }
}

0 个答案:

没有答案