Question

我在 HDFS ，

中有如下目录结构

/data/current/population/{p_1,p_2} /data/current/sport /data/current/weather/{w_1,w_2,w_3} /data/current/industry

文件夹population, sport, weather & industry分别对应不同的数据集。结束文件夹，例如p_1＆amp; p_2，适用于不同的数据源（如果有）。

我正在处理PySpark代码，这些代码适用于这些A_1, A_2, B, C_1, C_2, C_3 & D文件夹（结束文件夹）。给代码/data/current/之类的路径，如何提取最终文件夹的绝对路径？

命令 hdfs dfs -ls -R /data/current 提供以下输出

/data/current /data/current/population /data/current/population/p_1 /data/current/population/p_2 /data/current/sport /data/current/weather /data/current/weather/w_1 /data/current/weather/w_2 /data/current/weather/w_3 /data/current/industry

但我想最终得到终端文件夹的绝对路径。我的输出应该如下所示

/data/current/population/p_1 /data/current/population/p_2 /data/current/sport /data/current/weather/w_1 /data/current/weather/w_2 /data/current/weather/w_3 /data/current/industry

- 提前谢谢

Answer 1

为什么不使用像SnakeBite这样的HDFS客户端编写代码。

我正在附加scala函数以执行相同的操作。此函数采用根文件夹路径并提供所有结束路径的列表。您可以使用SnakeBite在python中执行相同的操作。

    def traverse(path: Path, col: ListBuffer[String]): ListBuffer[String] = {
      val stats = fs.listStatus(path)
      for (stat <- stats) {
        if (stat.isFile()) {
          col += stat.getPath.toString()
        } else {
          val nl = fs.listStatus(stat.getPath)
          if (nl.isEmpty)
            col += stat.getPath.toString()
          else {
            for (n <- nl) {
              if (n.isFile) {
                col += n.getPath.toString()
              } else {
                col ++= traverse(n.getPath, new ListBuffer)
              }
            }
          }
        }
      }

      col
    }

如何获得最终目录的绝对路径？

1 个答案: