如何在每一行添加行号?

时间:2015-07-03 19:10:35

标签: scala text apache-spark

假设这些是我的数据:

‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.

我想在每一行添加一个数字,如下面的输出:

1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.

将它们保存到文件中。

我试过了:

object DS_E5 {
  def main(args: Array[String]): Unit = {

    var i=0
    val conf = new SparkConf().setAppName("prep").setMaster("local")
    val sc = new SparkContext(conf)
    val sample1 = sc.textFile("data.txt")
    for(sample<-sample1){
      i=i+1
      val ss=sample.map(l=>(i,sample))
      println(ss)
    }
 }
}

但它的输出就像吹:

Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...

如何编辑我的代码以生成类似我最喜欢的输出的输出?

2 个答案:

答案 0 :(得分:5)

zipWithIndex就是你需要的。它通过在对的第二个位置添加索引,从RDD[T]映射到RDD[(T, Long)]

sample1
   .zipWithIndex()
   .map { case (line, i) => i.toString + ", " + line }

或使用字符串插值(请参阅@ DanielC.Sobral的评论)

sample1
    .zipWithIndex()
    .map { case (line, i) => s"$i, $line" }

答案 1 :(得分:2)

致电val sample1 = sc.textFile("data.txt")即可创建新的RDD

如果您只需输出,则可以尝试使用下一个代码:

  

sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))

基本上,通过使用此代码,您将执行此操作:

  1. 使用.zipWithIndex()将返回新的RDD[(T, Long)],其中(T, Long)TupleT是以前的RDD元素数据类型(java.lang.String,我相信)和Long是RDD中元素的索引。
  2. 您执行了转换,现在您需要进行操作foreach在这种情况下,非常适合。基本上是做什么的:它将您的语句应用于当前RDD中的每个元素,因此我们只需快速调用格式println