Question

我在HDFS上有一个目录目录，我想迭代这些目录。使用SparkContext对象有没有简单的方法来使用Spark？

Answer 1

您可以使用org.apache.hadoop.fs.FileSystem。具体而言，FileSystem.listFiles([path], true)

和Spark一起......

FileSystem.get(sc.hadoopConfiguration()).listFiles(..., true)

修改

值得注意的是，良好做法是获得与FileSystem计划相关联的Path。

path.getFileSystem(sc.hadoopConfiguration).listFiles(path, true)

Answer 2

如果有人有兴趣，这是PySpark版本：

    private readonly MyDbContext _context;

    public SearchController(MyDbContext context)
    {
        _context = context;
    }

在这种特殊情况下，我得到组成disc_mrt.unified_fact Hive表的所有文件的列表。

这里描述了FileStatus对象的其他方法，比如getLen（）来获取文件大小：

Class FileStatus

Answer 3

import  org.apache.hadoop.fs.{FileSystem,Path}

FileSystem.get( sc.hadoopConfiguration ).listStatus( new Path("hdfs:///tmp")).foreach( x => println(x.getPath ))

这对我有用。

Spark版本1.5.0-cdh5.5.2

Answer 4

这为我做了这份工作

FileSystem.get(new URI("hdfs://HAservice:9000"), sc.hadoopConfiguration).listStatus( new Path("/tmp/")).foreach( x => println(x.getPath ))

Answer 5

@Tagar没有说如何连接远程hdfs，但是this answer做了：

URI           = sc._gateway.jvm.java.net.URI
Path          = sc._gateway.jvm.org.apache.hadoop.fs.Path
FileSystem    = sc._gateway.jvm.org.apache.hadoop.fs.FileSystem
Configuration = sc._gateway.jvm.org.apache.hadoop.conf.Configuration


fs = FileSystem.get(URI("hdfs://somehost:8020"), Configuration())

status = fs.listStatus(Path('/some_dir/yet_another_one_dir/'))

for fileStatus in status:
    print(fileStatus.getPath())

Answer 6

您也可以尝试使用globStatus状态

val listStatus = org.apache.hadoop.fs.FileSystem.get(new URI(url), sc.hadoopConfiguration).globStatus(new org.apache.hadoop.fs.Path(url))

      for (urlStatus <- listStatus) {
        println("urlStatus get Path:"+urlStatus.getPath())
}

Answer 7

斯卡拉 FileSystem (Apache Hadoop Main 3.2.1 API)

import org.apache.hadoop.fs.{FileSystem, Path}

val fileSystem : FileSystem = {
    val conf = new Configuration()
    conf.set( "fs.defaultFS", "hdfs://to_file_path" )
    FileSystem.get( conf )
}

val files = fileSystem.listFiles( new Path( path ), false )
val filenames = ListBuffer[ String ]( )
while ( files.hasNext ) files.next().getPath().toString()

Answer 8

我在其他答案上有一些问题（例如“ JavaObject”对象不可迭代），但是这段代码对我有用

fs = self.spark_contex._jvm.org.apache.hadoop.fs.FileSystem.get(spark_contex._jsc.hadoopConfiguration())
i = fs.listFiles(spark_contex._jvm.org.apache.hadoop.fs.Path(path), False)
while i.hasNext():
  f = i.next()
  print(f.getPath())

Answer 9

您可以使用以下代码通过父 HDFS 目录递归迭代，仅存储子目录直到第三级。这很有用，如果您需要列出由于数据分区而创建的所有目录（在下面的代码中，三列用于分区）：

val fs = FileSystem.get(spark.sparkContext.hadoopConfiguration)

def rememberDirectories(fs: FileSystem, path: List[Path]): List[Path] = {
  val buff = new ListBuffer[LocatedFileStatus]()

  path.foreach(p => {
    val iter = fs.listLocatedStatus(p)
    while (iter.hasNext()) buff += iter.next()
  })

  buff.toList.filter(p => p.isDirectory).map(_.getPath)
}

@tailrec
def getRelevantDirs(fs: FileSystem, p: List[Path], counter: Int = 1): List[Path] = {
  val levelList = rememberDirectories(fs, p)
  if(counter == 3) levelList
  else getRelevantDirs(fs, levelList, counter + 1)
}

Spark迭代HDFS目录

9 个答案: