zipwithindex为Rdd密钥并获取新的RDD

时间:2019-05-18 20:47:25

标签: scala apache-spark rdd

我正在使用wholeTextfiles创建rdd。我正在获取文件路径和文件文本。我想要新的RDD,其文件路径和索引来自zipWithIndex

我尝试过地图,但没有成功。

1 个答案:

答案 0 :(得分:1)

  

这是第一件事吗?从理论上讲我们可以做...但是按要求做这件事是必要的。你可以写   普通的hdfs程序可以找到文件名和索引..我的意思是spark rdd   不再需要显示文件名。带有索引。


我有以下文件。

enter image description here

现在我正在进行如下所示的转化...

import org.apache.log4j.{Level, Logger}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession

/** *
  * @author : Ram Ghadiyaram
  */
object WholeTextFiles extends Logging {
  Logger.getLogger("org").setLevel(Level.WARN)

  def main(args: Array[String]): Unit = {
    val appName = if (args.length > 0) args(0) else this.getClass.getName
    val spark: SparkSession = SparkSession.builder
      .config("spark.master", "local[*]") //.config("spark.eventLog.enabled", "true")
      .appName(appName)
      .getOrCreate()

// map transformation to form new rdd
    val finalresult = spark.sparkContext.wholeTextFiles("C:\\Users\\Downloads\\codebase\\spark-general-examples\\userdata*.parquet")
    .zipWithIndex().map {
      case (x, index) => (index, x._1)
    }

    println("  print the small rdd this is your tranformed RDD ")

     finalresult.sortByKey(true) foreach {
      case (index,x ) => println(s"\n Index $index file name  ${x}  ")
    }
    println("done")
  }
}

结果:

  print the small rdd this is your tranformed RDD 

 Index 0 file name  file:/C:/Users/Downloads/codebase/spark-general-examples/userdata1.parquet  

 Index 3 file name  file:/C:/Users/Downloads/codebase/spark-general-examples/userdata4.parquet  

 Index 1 file name  file:/C:/Users/Downloads/codebase/spark-general-examples/userdata2.parquet  

 Index 4 file name  file:/C:/Users/Downloads/codebase/spark-general-examples/userdata5.parquet  

 Index 2 file name  file:/C:/Users/Downloads/codebase/spark-general-examples/userdata3.parquet  
done