Spark: How to save the output (saveAsTextFile) to files with equal size?

时间:2015-09-01 21:36:52

标签: scala apache-spark rdd

I have the following code, trying to output the RDD to 1000 files with equal file size. However, I still got only 70 output files, and the file size are very different (range from 50M to 2G). Is there any additional step I need to do in order to make the output files having equal size? Thank you!

val myRDD = input.flatMap { t => ??? }
                 .reduceByKey { (t1, t2) => ??? ; t3 }
                 .sortBy(-_._2.size)
                 .repartition(1000)
                 .map(t => (t._1 + "_" + t._2.size, t._2.toString))

myRDD.saveAsTextFile("myOutput", classOf[GzipCodec])

3 个答案:

答案 0 :(得分:0)

您可以使用RangePartitioner创建相同大小的分区,然后将其保存为文本文件。

取自there

的示例
import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")    
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))
data.glom().collect()(0).length
data.glom().collect()(1).length
data.glom().collect()(2).length

在您的情况下,运行saveAsTextFile()应该足够了,而不是收集和检查长度。

答案 1 :(得分:0)

这是相当直接的,您需要做的就是使用重新分区(1000)  并且您的文件大小将相等且恰好为1000

您的代码已修改:

val myRDD = input.flatMap { t => ??? }
                 .reduceByKey { (t1, t2) => ??? ; t3 }
                 .sortBy(-_._2.size)
                 .repartition(1000)
                 .map(t => (t._1 + "_" + t._2.size, t._2.toString)).repartition(1000)

myRDD.saveAsTextFile("myOutput", classOf[GzipCodec])

答案 2 :(得分:0)

the following answer will solve your purpose

val myRDD = input.flatMap { t => ??? }
                 .reduceByKey { (t1, t2) => ??? ; t3 }
                 .sortBy(-_._2.size)
                 .repartition(1000)
                 .map(t => (t._1 + "_" + t._2.size, t._2.toString))

myRDD.repartition(1000).saveAsTextFile("myOutput", classOf[GzipCodec])

One thing to note that original rdd will have it's existance even after this because it is immutable 

or even you can use coalesce(1000) if you want to set max partitions to set