我正在使用wholeTextfiles
创建rdd。我正在获取文件路径和文件文本。我想要新的RDD
,其文件路径和索引来自zipWithIndex
我尝试过地图,但没有成功。
答案 0 :(得分:1)
这是第一件事吗?从理论上讲我们可以做...但是按要求做这件事是必要的。你可以写 普通的hdfs程序可以找到文件名和索引..我的意思是spark rdd 不再需要显示文件名。带有索引。
我有以下文件。
现在我正在进行如下所示的转化...
import org.apache.log4j.{Level, Logger}
import org.apache.spark.internal.Logging
import org.apache.spark.sql.SparkSession
/** *
* @author : Ram Ghadiyaram
*/
object WholeTextFiles extends Logging {
Logger.getLogger("org").setLevel(Level.WARN)
def main(args: Array[String]): Unit = {
val appName = if (args.length > 0) args(0) else this.getClass.getName
val spark: SparkSession = SparkSession.builder
.config("spark.master", "local[*]") //.config("spark.eventLog.enabled", "true")
.appName(appName)
.getOrCreate()
// map transformation to form new rdd
val finalresult = spark.sparkContext.wholeTextFiles("C:\\Users\\Downloads\\codebase\\spark-general-examples\\userdata*.parquet")
.zipWithIndex().map {
case (x, index) => (index, x._1)
}
println(" print the small rdd this is your tranformed RDD ")
finalresult.sortByKey(true) foreach {
case (index,x ) => println(s"\n Index $index file name ${x} ")
}
println("done")
}
}
结果:
print the small rdd this is your tranformed RDD
Index 0 file name file:/C:/Users/Downloads/codebase/spark-general-examples/userdata1.parquet
Index 3 file name file:/C:/Users/Downloads/codebase/spark-general-examples/userdata4.parquet
Index 1 file name file:/C:/Users/Downloads/codebase/spark-general-examples/userdata2.parquet
Index 4 file name file:/C:/Users/Downloads/codebase/spark-general-examples/userdata5.parquet
Index 2 file name file:/C:/Users/Downloads/codebase/spark-general-examples/userdata3.parquet
done