由于长RDD沿袭导致的Stackoverflow

时间:2015-12-25 09:49:10

标签: scala apache-spark rdd

我在HDFS中有数以千计的小文件。需要处理稍小的文件子集(也是数千个),fileList包含需要处理的文件路径列表。

select t1.col1, t1.col2 from table1 t1 join table2 t2 on t1.id = t2.id

//一旦退出循环,执行任何操作都会导致由于RDD的长谱系而导致堆栈溢出错误

// fileList == list of filepaths in HDFS

var masterRDD: org.apache.spark.rdd.RDD[(String, String)] = sparkContext.emptyRDD

for (i <- 0 to fileList.size() - 1) {

val filePath = fileStatus.get(i)
val fileRDD = sparkContext.textFile(filePath)
val sampleRDD = fileRDD.filter(line => line.startsWith("#####")).map(line => (filePath, line)) 

masterRDD = masterRDD.union(sampleRDD)

}

masterRDD.first()

1 个答案:

答案 0 :(得分:31)

通常,您可以使用检查点来打破长谱系。一些或多或少类似的应该工作:

import org.apache.spark.rdd.RDD
import scala.reflect.ClassTag

val checkpointInterval: Int = ???

def loadAndFilter(path: String) = sc.textFile(path)
  .filter(_.startsWith("#####"))
  .map((path, _))

def mergeWithLocalCheckpoint[T: ClassTag](interval: Int)
  (acc: RDD[T], xi: (RDD[T], Int)) = {
    if(xi._2 % interval == 0 & xi._2 > 0) xi._1.union(acc).localCheckpoint
    else xi._1.union(acc)
  }

val zero: RDD[(String, String)] = sc.emptyRDD[(String, String)]
fileList.map(loadAndFilter).zipWithIndex
  .foldLeft(zero)(mergeWithLocalCheckpoint(checkpointInterval))

在这种特殊情况下,更简单的解决方案应该是使用SparkContext.union方法:

val masterRDD = sc.union(
  fileList.map(path => sc.textFile(path)
    .filter(_.startsWith("#####"))
    .map((path, _))) 
)

当您查看循环/ reduce生成的DAG时,这些方法之间的区别应该是显而易见的:

enter image description here

和一个union

enter image description here

当然,如果文件很小,您可以将wholeTextFilesflatMap合并,一次阅读所有文件:

sc.wholeTextFiles(fileList.mkString(","))
  .flatMap{case (path, text) =>  
    text.split("\n").filter(_.startsWith("#####")).map((path, _))}