I have the following code, trying to output the RDD to 1000 files with equal file size. However, I still got only 70 output files, and the file size are very different (range from 50M to 2G). Is there any additional step I need to do in order to make the output files having equal size? Thank you!
val myRDD = input.flatMap { t => ??? }
.reduceByKey { (t1, t2) => ??? ; t3 }
.sortBy(-_._2.size)
.repartition(1000)
.map(t => (t._1 + "_" + t._2.size, t._2.toString))
myRDD.saveAsTextFile("myOutput", classOf[GzipCodec])
答案 0 :(得分:0)
您可以使用RangePartitioner创建相同大小的分区,然后将其保存为文本文件。
取自there:
的示例import org.apache.spark.RangePartitioner;
var file=sc.textFile("<my local path>")
var partitionedFile=file.map(x=>(x,1))
var data= partitionedFile.partitionBy(new RangePartitioner(3, partitionedFile))
data.glom().collect()(0).length
data.glom().collect()(1).length
data.glom().collect()(2).length
在您的情况下,运行saveAsTextFile()
应该足够了,而不是收集和检查长度。
答案 1 :(得分:0)
这是相当直接的,您需要做的就是使用重新分区(1000) 并且您的文件大小将相等且恰好为1000
您的代码已修改:
val myRDD = input.flatMap { t => ??? }
.reduceByKey { (t1, t2) => ??? ; t3 }
.sortBy(-_._2.size)
.repartition(1000)
.map(t => (t._1 + "_" + t._2.size, t._2.toString)).repartition(1000)
myRDD.saveAsTextFile("myOutput", classOf[GzipCodec])
答案 2 :(得分:0)
the following answer will solve your purpose
val myRDD = input.flatMap { t => ??? }
.reduceByKey { (t1, t2) => ??? ; t3 }
.sortBy(-_._2.size)
.repartition(1000)
.map(t => (t._1 + "_" + t._2.size, t._2.toString))
myRDD.repartition(1000).saveAsTextFile("myOutput", classOf[GzipCodec])
One thing to note that original rdd will have it's existance even after this because it is immutable
or even you can use coalesce(1000) if you want to set max partitions to set