我们在aws s3中有82k +带gz压缩的文本文件。我正在尝试计算该数据中的特定字段。下面是我从文档中尝试但是,它正在永远处理。最可能的是我错过了一些东西。我该如何加快这个过程?
spark-shell --master yarn --driver-memory 10g --executor-memory 10g
15/11/26 10:29:14 INFO MemoryStore: MemoryStore started with capacity 5.2 GB
val rdd = sc.textFile("s3:path_forfiles*/*.gz")
val count = rdd.map(x => x.split("\\|")).filter(arr => (arr.length > 3))
.map(x => (x(2),1))
.reduceByKey((a, b) => a + b)
scala> val TotCount = count.collect()
具有10个节点和500 GB内存的Cloudera群集
部分堆栈跟踪
15/11/26 10:47:36 INFO SparkContext: Starting job: collect at <console>:29
15/11/26 10:47:36 INFO DAGScheduler: Registering RDD 4 (map at <console>:25)
15/11/26 10:47:36 INFO DAGScheduler: Got job 0 (collect at <console>:29) with 84787 output partitions (allowLocal=false)
15/11/26 10:47:36 INFO DAGScheduler: Final stage: Stage 1(collect at <console>:29)
15/11/26 10:47:36 INFO DAGScheduler: Parents of final stage: List(Stage 0)
15/11/26 10:47:36 INFO DAGScheduler: Missing parents: List(Stage 0)
15/11/26 10:47:37 INFO DAGScheduler: Submitting Stage 0 (MapPartitionsRDD[4] at map at <console>:25), which has no missing parents
15/11/26 10:47:37 INFO MemoryStore: ensureFreeSpace(3920) called with curMem=296213, maxMem=5556708311
15/11/26 10:47:37 INFO MemoryStore: Block broadcast_1 stored as values in memory (estimated size 3.8 KB, free 5.2 GB)
15/11/26 10:47:37 INFO MemoryStore: ensureFreeSpace(2226) called with curMem=300133, maxMem=5556708311
15/11/26 10:47:37 INFO MemoryStore: Block broadcast_1_piece0 stored as bytes in memory (estimated size 2.2 KB, free 5.2 GB)
15/11/26 10:47:37 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS? (size: 2.2 KB, free: 5.2 GB)
15/11/26 10:47:37 INFO BlockManagerMaster: Updated info of block broadcast_1_piece0
15/11/26 10:47:37 INFO SparkContext: Created broadcast 1 from broadcast at DAGScheduler.scala:834
15/11/26 10:47:37 INFO DAGScheduler: Submitting 84787 missing tasks from Stage 0 (MapPartitionsRDD[4] at map at <console>:25)
15/11/26 10:47:38 INFO YarnScheduler: Adding task set 0.0 with 84787 tasks
15/11/26 10:47:38 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, <IP_ADDRESS>
15/11/26 10:47:38 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_1_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:38 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 2.0 in stage 0.0 (TID 2, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 1.0 in stage 0.0 (TID 1) in 3523 ms on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 3.0 in stage 0.0 (TID 3, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 2.0 in stage 0.0 (TID 2) in 126 ms on <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Starting task 4.0 in stage 0.0 (TID 4, <IP_ADDRESS>
15/11/26 10:47:41 INFO TaskSetManager: Finished task 3.0 in stage 0.0 (TID 3) in 141 ms on <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Starting task 5.0 in stage 0.0 (TID 5, <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Finished task 0.0 in stage 0.0 (TID 0) in 4179 ms on <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Starting task 6.0 in stage 0.0 (TID 6, <IP_ADDRESS>
15/11/26 10:47:42 INFO TaskSetManager: Finished task 5.0 in stage 0.0 (TID 5) in 530 ms on <IP_ADDRESS>
15/11/26 10:47:43 INFO TaskSetManager: Starting task 7.0 in stage 0.0 (TID 7, <IP_ADDRESS>
15/11/26 10:47:43 INFO TaskSetManager: Finished task 6.0 in stage 0.0 (TID 6) in 1135 ms on <IP_ADDRESS>
15/11/26 10:47:44 INFO TaskSetManager: Starting task 8.0 in stage 0.0 (TID 8, <IP_ADDRESS>
15/11/26 10:29:18 INFO YarnClientSchedulerBackend: ApplicationMaster registered as Actor[akka.tcp://sparkYarnAM@<IP_ADDRESS>
15/11/26 10:29:18 INFO YarnClientSchedulerBackend: Add WebUI Filter. org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter, Map(PROXY_HOSTS -> <IP_ADDRESS>
15/11/26 10:29:18 INFO JettyUtils: Adding filter: org.apache.hadoop.yarn.server.webproxy.amfilter.AmIpFilter
15/11/26 10:29:18 INFO NettyBlockTransferService: Server created on 37858
15/11/26 10:29:18 INFO BlockManagerMaster: Trying to register BlockManager
15/11/26 10:29:18 INFO BlockManagerMasterActor: Registering block manager <IP_ADDRESS? with 5.2 GB RAM, BlockManagerId(<driver>, <IP_ADDRESS>
15/11/26 10:29:18 INFO BlockManagerMaster: Registered BlockManager
15/11/26 10:29:19 INFO EventLoggingListener: Logging events to hdfs://<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: Registered executor: Actor[akka.tcp://sparkExecutor@<IP_ADDRESS>
15/11/26 10:29:20 INFO YarnClientSchedulerBackend: SchedulerBackend is ready for scheduling beginning after reached minRegisteredResourcesRatio: 0.8
15/11/26 10:29:20 INFO SparkILoop: Created spark context..
Spark context available as sc.
15/11/26 10:46:59 INFO MemoryStore: ensureFreeSpace(273447) called with curMem=0, maxMem=5556708311
15/11/26 10:46:59 INFO MemoryStore: Block broadcast_0 stored as values in memory (estimated size 267.0 KB, free 5.2 GB)
15/11/26 10:47:00 INFO MemoryStore: ensureFreeSpace(22766) called with curMem=273447, maxMem=5556708311
15/11/26 10:47:00 INFO MemoryStore: Block broadcast_0_piece0 stored as bytes in memory (estimated size 22.2 KB, free 5.2 GB)
15/11/26 10:47:00 INFO BlockManagerInfo: Added broadcast_0_piece0 in memory on <IP_ADDRESS? (size: 22.2 KB, free: 5.2 GB)
15/11/26 10:47:00 INFO BlockManagerMaster: Updated info of block broadcast_0_piece0
15/11/26 10:47:00 INFO SparkContext: Created broadcast 0 from textFile at <console>:21
rdd: org.apache.spark.rdd.RDD[String] = s3n://tivo-arm-logs/201408*/* MapPartitionsRDD[1] at textFile at <console>:21
scala> val spl = rdd.map(x => x.split("\\|"))
spl: org.apache.spark.rdd.RDD[Array[String]] = MapPartitionsRDD[2] at map at <console>:23
scala> val fin = spl.filter(arr => (arr.length > 3)).map(x => (x(2),1))
fin: org.apache.spark.rdd.RDD[(String, Int)] = MapPartitionsRDD[4] at map at <console>:25
scala> val count = fin.reduceByKey((a, b) => a + b)
15/11/26 10:47:18 INFO FileInputFormat: Total input paths to process : 84787
count: org.apache.spark.rdd.RDD[(String, Int)] = ShuffledRDD[5] at reduceByKey at <console>:27