Question

在Dataproc下，我设置了一个包含1个主节点和2个工作人员的PySpark集群。在存储桶中，我有文件子目录的目录。

在Datalab笔记本中我运行

import subprocess
all_parent_direcotry = subprocess.Popen("gsutil ls gs://parent-directories ",shell=True,stdout=subprocess.PIPE).stdout.read()

这给了我所有子目录没有问题。

然后我希望gsutil ls子目录中的所有文件，所以在主节点中我得到了：

def get_sub_dir(path):
    import subprocess
    p = subprocess.Popen("gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)
    return p.stdout.read(), p.stderr.read()

并运行get_sub_dir(sub-directory)，这会使所有文件都没有问题。

然而，

 sub_dir = sc.parallelize([sub-directory])
 sub_dir.map(get_sub_dir).collect()

给了我：

 Traceback (most recent call last):
  File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 99, in <module>
    main()
  File "/usr/bin/../lib/google-cloud-sdk/bin/bootstrapping/gsutil.py", line 30, in main
    project, account = bootstrapping.GetActiveProjectAndAccount()
  File "/usr/lib/google-cloud-sdk/bin/bootstrapping/bootstrapping.py", line 205, in GetActiveProjectAndAccount
    project_name = properties.VALUES.core.project.Get(validate=False)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1373, in Get
    required)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1661, in _GetProperty
    value = _GetPropertyWithoutDefault(prop, properties_file)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/properties.py", line 1699, in _GetPropertyWithoutDefault
    value = callback()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 222, in GetProject
    return c_gce.Metadata().Project()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 203, in Metadata
    _metadata_lock.lock(function=_CreateMetadata, argument=None)
  File "/usr/lib/python2.7/mutex.py", line 44, in lock
    function(argument)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 202, in _CreateMetadata
    _metadata = _GCEMetadata()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 59, in __init__
    self.connected = gce_cache.GetOnGCE()
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 141, in GetOnGCE
    return _SINGLETON_ON_GCE_CACHE.GetOnGCE(check_age)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 81, in GetOnGCE
    self._WriteDisk(on_gce)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce_cache.py", line 113, in _WriteDisk
    with files.OpenForWritingPrivate(gce_cache_path) as gcecache_file:
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 715, in OpenForWritingPrivate
    MakeDir(full_parent_dir_path, mode=0700)
  File "/usr/lib/google-cloud-sdk/lib/googlecloudsdk/core/util/files.py", line 115, in MakeDir
    (u'Please verify that you have permissions to write to the parent '
googlecloudsdk.core.util.files.Error: Could not create directory [/home/.config/gcloud]: Permission denied.

Please verify that you have permissions to write to the parent directory.

检查后，在whoami的工作节点上显示yarn。

所以问题是，如何授权yarn使用gsutil，还是有其他方法可以从Dataproc PySpark Worker节点访问存储桶？

Answer 1

CLI查看当前homedir，以获取在从元数据服务获取令牌时放置缓存凭据文件的位置。 googlecloudsdk/core/config.py中的相关代码如下所示：

def _GetGlobalConfigDir():
  """Returns the path to the user's global config area.

  Returns:
    str: The path to the user's global config area.
  """
  # Name of the directory that roots a cloud SDK workspace.
  global_config_dir = encoding.GetEncodedValue(os.environ, CLOUDSDK_CONFIG)
  if global_config_dir:
    return global_config_dir
  if platforms.OperatingSystem.Current() != platforms.OperatingSystem.WINDOWS:
    return os.path.join(os.path.expanduser('~'), '.config',
                        _CLOUDSDK_GLOBAL_CONFIG_DIR_NAME)

对于在YARN容器中运行的内容，尽管以用户yarn运行，但如果您只是运行sudo su yarn，则会在数据加速器上看到~解析为/var/lib/hadoop-yarn节点，YARN实际上将yarn.nodemanager.user-home-dir传播为容器的homedir，默认为/home/。因此，即使您可以sudo -u yarn gsutil ...，它的行为与YARN容器中的gsutil行为不同，当然只有root能够在基础{{1}中创建目录}目录。

长话短说，你有两个选择：

在您的代码中，在/home/声明之前添加HOME=/var/lib/hadoop-yarn。

示例：

gsutil

创建群集时，请指定YARN属性。

示例：

   p = subprocess.Popen("HOME=/var/lib/hadoop-yarn gsutil ls gs://parent-directories/" + path, shell=True,stdout=subprocess.PIPE, stderr=subprocess.PIPE)

对于现有群集，您还可以手动将配置添加到所有工作人员的gcloud dataproc clusters create --properties yarn:yarn.nodemanager.user-home-dir=/var/lib/hadoop-yarn ...，然后重新启动工作计算机（或只运行/etc/hadoop/conf/yarn-site.xml），但这可能是手动运行的麻烦在所有工作节点上。

Dataproc PySpark Workers没有使用gsutil的权限

1 个答案: