在对输入的非结构化数据进行大量过滤和清理之后,我现在构建了字符串RDD [(字符串)的RDD,其中各个参数以逗号分隔以供进一步使用。其中一个先决条件是我需要索引字符串组,如下面的
val inputRDD1:RDD[(String,Long)] = myUtilities.paragraphFile(spark,path1)
.coalesce(100*spark.defaultParallelism)
.zipWithIndex() //RDD[String, Long]
.filter(f => f._2!=0)
val cleanedRDD1 = myUtilities.cleanSplitData(inputRDD1)
.saveAsTextFile("/home/path/cleanedrdd1")
1,parm1,parm2,parm3..
1,parm1,parm2,parm3..
1,parm1,parm2,parm3..
2,parm1,parm2,parm3..
2,parm1,parm2,parm3..
3,parm1,parm2,parm3..
3,parm1,parm2,parm3..
3,parm1,parm2,parm3..
3,parm1,parm2,parm3..
.
.
上述输出格式是基于索引运行组操作所必需的。当输入文件很小时,这很有效。但是,当文件大小很大时,索引将根据分区数再次从1开始。 我能够通过保存中间rdd
来测试这个sortedrdd1目录有几个部分文件(当输入文件很大时),每个部分文件从1开始索引。相反,我需要部分文件是连续的。有没有办法做到这一点?
EDIT1:
def cleanSplitData(inputRDD:RDD[(String)]): org.apache.spark.rdd.RDD[String] = {
var cntflag = false
val cleanedRDD = inputRDD
.flatMap(line => line._1.split("\n")) //split at newline
//Note: a bunch of filtering operations are run prior to this step.
.map(row => {
//This step is to make sure rows following a row starting with "D," or "#" have the same index.
//This is required to run groupBy/AggBy operations later. Hence the counter cnt.
if(row.startsWith("D,") || row.startsWith("#")){ cntflag=true } else cntflag=false
if(cntflag==true) cnt+=1
(cnt + "," + row)
})
return cleanedRDD
}