Question

假设这些是我的数据：

‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
‘Map’ is responsible to read data from input location.
it will generate a key value pair.
that is, an intermediate output in local machine.
’Reducer’ is responsible to process the intermediate.
output received from the mapper and generate the final output.

我想在每一行添加一个数字，如下面的输出：

1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.
2,‘Map’ is responsible to read data from input location.
3,it will generate a key value pair.
4,that is, an intermediate output in local machine.
5,’Reducer’ is responsible to process the intermediate.
6,output received from the mapper and generate the final output.

将它们保存到文件中。

我试过了：

object DS_E5 {
  def main(args: Array[String]): Unit = {

    var i=0
    val conf = new SparkConf().setAppName("prep").setMaster("local")
    val sc = new SparkContext(conf)
    val sample1 = sc.textFile("data.txt")
    for(sample<-sample1){
      i=i+1
      val ss=sample.map(l=>(i,sample))
      println(ss)
    }
 }
}

但它的输出就像吹：

Vector((1,‘Maps‘ and ‘Reduces‘ are two phases of solving a query in HDFS.))
...

如何编辑我的代码以生成类似我最喜欢的输出的输出？

Answer 1

zipWithIndex就是你需要的。它通过在对的第二个位置添加索引，从RDD[T]映射到RDD[(T, Long)]。

sample1
   .zipWithIndex()
   .map { case (line, i) => i.toString + ", " + line }

或使用字符串插值（请参阅@ DanielC.Sobral的评论）

sample1
    .zipWithIndex()
    .map { case (line, i) => s"$i, $line" }

Answer 2

致电val sample1 = sc.textFile("data.txt")即可创建新的RDD。

如果您只需输出，则可以尝试使用下一个代码：

sample1.zipWithIndex().foreach(f => println(f._2 + ", " + f._1))

基本上，通过使用此代码，您将执行此操作：

使用.zipWithIndex()将返回新的RDD[(T, Long)]，其中(T, Long)是Tuple，T是以前的RDD元素数据类型（java.lang.String，我相信）和Long是RDD中元素的索引。
您执行了转换，现在您需要进行操作。 foreach，在这种情况下，非常适合。基本上是做什么的：它将您的语句应用于当前RDD中的每个元素，因此我们只需快速调用格式println。

如何在每一行添加行号？

2 个答案: