假设这些是我的数据:
‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.
我想在每一行添加一个数字,如下面的输出:
1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.
将它们保存到文件中。
我试过了:
object DS_E5 {
def main(args: Array[String]): Unit = {
var i=0
val conf = new SparkConf().setAppName("prep").setMaster("local")
val sc = new SparkContext(conf)
val sample1 = sc.textFile("data.txt")
for(sample<-sample1){
i=i+1
val ss=sample.map(l=>(i,sample))
println(ss)
}
}
}
但它的输出就像吹:
Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...
如何编辑我的代码以生成类似我最喜欢的输出的输出?
答案 0 :(得分:5)
zipWithIndex
就是你需要的。它通过在对的第二个位置添加索引,从RDD[T]
映射到RDD[(T, Long)]
。
sample1
.zipWithIndex()
.map { case (line, i) => i.toString + ", " + line }
或使用字符串插值(请参阅@ DanielC.Sobral的评论)
sample1
.zipWithIndex()
.map { case (line, i) => s"$i, $line" }
答案 1 :(得分:2)
致电val sample1 = sc.textFile("data.txt")
即可创建新的RDD。
如果您只需输出,则可以尝试使用下一个代码:
sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))
基本上,通过使用此代码,您将执行此操作:
.zipWithIndex()
将返回新的RDD[(T, Long)]
,其中(T, Long)
是Tuple,T
是以前的RDD元素数据类型(java.lang.String
,我相信)和Long
是RDD中元素的索引。foreach
,在这种情况下,非常适合。基本上是做什么的:它将您的语句应用于当前RDD中的每个元素,因此我们只需快速调用格式println
。