Question

我需要在每个map（）中读取一个不同的文件，该文件在HDFS中

  val rdd=sc.parallelize(1 to 10000)
  val rdd2=rdd.map{x=>
    val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())
    val path=new Path("/user/zhc/"+x+"/")
    val t=hdfs.listStatus(path)
    val in =hdfs.open(t(0).getPath)
    val reader = new BufferedReader(new InputStreamReader(in))
    var l=reader.readLine()
  }
 rdd2.count

我的问题是这段代码

val hdfs = org.apache.hadoop.fs.FileSystem.get(new java.net.URI("hdfs://ITS-Hadoop10:9000/"), new org.apache.hadoop.conf.Configuration())

需要太多的运行时间，每次map（）都需要创建一个新的FileSystem值。我可以将此代码放在map（）函数之外，这样就不必每次都创建hdfs吗？或者我如何在map（）中快速读取文件？

我的代码在多台计算机上运行。谢谢！

Answer 1

在你的情况下，我建议使用wholeTextFiles方法，它将返回pairRdd，密钥是文件的完整路径，值是文件的字符串内容。

val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/")
val filesLineCount = filesPariRDD.map( x => (x._1, x._2.length ) ) //this will return a map of fileName , number of lines of each file. You could apply any other function on the file contents
filesLineCount.collect()

修改

如果您的文件位于同一目录下的目录中（如注释中所述），您可以使用某种正则表达式

val filesPariRDD = sc.wholeTextFiles("hdfs://ITS-Hadoop10:9000/*/"

希望这是明确和有用的

如何使用Spark快速从map（）中的HDFS读取文件

1 个答案: