Question

我知道我可以这样做：

data = sc.textFile('/hadoop_foo/a')
data.count()
240
data = sc.textFile('/hadoop_foo/*')
data.count()
168129

但是，我想计算＆＃34; / hadoop_foo /＆＃34;的每个子目录的数据大小。我能这样做吗？

换句话说，我想要的是这样的：

subdirectories = magicFunction()
for subdir in subdirectories:
  data sc.textFile(subdir)
  data.count()

我尝试过：

In [9]: [x[0] for x in os.walk("/hadoop_foo/")]
Out[9]: []

但我认为失败了，因为它在驱动程序的本地目录（在这种情况下是网关）中搜索，而＆＃34; / hadoop_foo /＆＃34;位于hdfs。同样适用于＆＃34; hdfs：/// hadoop_foo /＆＃34;。

hadoop dfs -lsr /hadoop_foo/

代码..

In [28]: os.getcwd()
Out[28]: '/homes/gsamaras'  <-- which is my local directory

Answer 1

使用python使用hdfs模块; walk()方法可以获取文件列表。

代码看起来像这样：

from hdfs import InsecureClient

client = InsecureClient('http://host:port', user='user')
for stuff in client.walk(dir, 0, True):
...

使用Scala，您可以获取文件系统（val fs = FileSystem.get(new Configuration())）并运行https://hadoop.apache.org/docs/r2.4.1/api/org/apache/hadoop/fs/FileSystem.html#listFiles(org.apache.hadoop.fs.Path，布尔值）

您也可以使用os.subprocess从脚本执行shell命令，但这绝不是推荐的方法，因为您依赖于shell实用程序的文本输出。

最终，对OP有用的是subprocess.check_output()：

subdirectories = subprocess.check_output(["hadoop","fs","-ls", "/hadoop_foo/"])