如何使用webhdfs列出HDFS目录内容?

时间:2016-06-23 10:51:32

标签: python json hadoop hdfs webhdfs

是否可以使用webhdfs检查HDFS中目录的内容?

这通常会hdfs dfs -ls,但使用webhdfs

如何使用Python 2.6列出webhdfs目录?

1 个答案:

答案 0 :(得分:5)

您可以使用LISTSTATUS动词。文档位于List a Directory,可在WebHDFS REST API文档中找到以下代码:

使用curl,这就是它的样子:

curl -i  "http://<HOST>:<PORT>/webhdfs/v1/<PATH>?op=LISTSTATUS"

响应是FileStatuses JSON对象:

{
  "name"      : "FileStatuses",
  "properties":
  {
    "FileStatuses":
    {
      "type"      : "object",
      "properties":
      {
        "FileStatus":
        {
          "description": "An array of FileStatus",
          "type"       : "array",
          "items"      : fileStatusProperties
        }
      }
    }
  }
}

fileStatusProperties(对于items字段)具有此JSON架构:

var fileStatusProperties =
{
  "type"      : "object",
  "properties":
  {
    "accessTime":
    {
      "description": "The access time.",
      "type"       : "integer",
      "required"   : true
    },
    "blockSize":
    {
      "description": "The block size of a file.",
      "type"       : "integer",
      "required"   : true
    },
    "group":
    {
      "description": "The group owner.",
      "type"       : "string",
      "required"   : true
    },
    "length":
    {
      "description": "The number of bytes in a file.",
      "type"       : "integer",
      "required"   : true
    },
    "modificationTime":
    {
      "description": "The modification time.",
      "type"       : "integer",
      "required"   : true
    },
    "owner":
    {
      "description": "The user who is the owner.",
      "type"       : "string",
      "required"   : true
    },
    "pathSuffix":
    {
      "description": "The path suffix.",
      "type"       : "string",
      "required"   : true
    },
    "permission":
    {
      "description": "The permission represented as a octal string.",
      "type"       : "string",
      "required"   : true
    },
    "replication":
    {
      "description": "The number of replication of a file.",
      "type"       : "integer",
      "required"   : true
    },
   "type":
    {
      "description": "The type of the path object.",
      "enum"       : ["FILE", "DIRECTORY"],
      "required"   : true
    }
  }
};

您可以使用pywebhdfs在Python中处理文件名,如下所示:

import json
from pprint import pprint
from pywebhdfs.webhdfs import PyWebHdfsClient

hdfs = PyWebHdfsClient(host='host',port='50070', user_name='hdfs')  # Use your own host/port/user_name config

data = hdfs.list_dir("dir/dir")  # Use your preferred directory, without the leading "/"

file_statuses = data["FileStatuses"]
pprint file_statuses   # Display the dict

for item in file_statuses["FileStatus"]:
    print item["pathSuffix"]   # Display the item filename

您可以根据需要实际使用项目,而不是print每个对象。 file_statuses的结果只是一个Python dict,因此只要您使用正确的密钥,它就可以像任何其他dict一样使用。