从云端外部使用Hadoop客户端访问GCS

时间:2019-04-16 11:55:43

标签: google-cloud-platform hdfs google-cloud-storage

我想通过Hadoop客户端访问Google Cloud Storage。我想在Google Cloud之外的计算机上使用它。

我遵循了here的指示。 我创建了服务帐户并生成了密钥文件。我还创建了core-site.xml文件并下载了必要的库。

但是,当我尝试运行简单的hdfs dfs -ls gs://bucket-name命令时,我得到的只是这样:

Error getting access token from metadata server at: http://169.254.169.254/computeMetadata/v1/instance/service-accounts/default/token

当我在内部内部执行此操作时,它可以运行,但是尝试从外部连接到GCS时,上面显示了错误。

如何通过这种方式通过Hadoop客户端连接到GCS?可能吗我没有通往169.254.169.254地址的路线。

这是我的core-site.xml(在此示例中,我更改了密钥路径和电子邮件):

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>
<!--
  Licensed under the Apache License, Version 2.0 (the "License");
  you may not use this file except in compliance with the License.
  You may obtain a copy of the License at

    http://www.apache.org/licenses/LICENSE-2.0

  Unless required by applicable law or agreed to in writing, software
  distributed under the License is distributed on an "AS IS" BASIS,
  WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
  See the License for the specific language governing permissions and
  limitations under the License. See accompanying LICENSE file.
-->

<!-- Put site-specific property overrides in this file. -->

<configuration>
  <property>
    <name>spark.hadoop.google.cloud.auth.service.account.enable</name>
    <value>true</value>
  </property>
  <property>
    <name>spark.hadoop.google.cloud.auth.service.account.json.keyfile</name>
    <value>path/to/key.json</value>
  </property>
  <property>
    <name>fs.gs.project.id</name>
    <value>ringgit-research</value>
    <description>
      Optional. Google Cloud Project ID with access to GCS buckets.
      Required only for list buckets and create bucket operations.
    </description>
  </property>
  <property>
    <name>fs.AbstractFileSystem.gs.impl</name>
    <value>com.google.cloud.hadoop.fs.gcs.GoogleHadoopFS</value>
    <description>The AbstractFileSystem for gs: uris.</description>
  </property>
  <property>
    <name>fs.gs.auth.service.account.email</name>
    <value>myserviceaccountaddress@google</value>
    <description>
      The email address is associated with the service account used for GCS
      access when fs.gs.auth.service.account.enable is true. Required
      when authentication key specified in the Configuration file (Method 1)
      or a PKCS12 certificate (Method 3) is being used.
    </description>
  </property>
</configuration>

2 个答案:

答案 0 :(得分:0)

可能是hadoop服务尚未执行您core-site.xml文件中所做的更新,因此我的建议是重新启动hadoop服务,您可以采取的另一项措施是检查访问控制选项[1]

如果采取建议的措施后仍然遇到相同的问题,请发布完整的错误消息。

[1] https://cloud.google.com/storage/docs/access-control/

答案 1 :(得分:0)

问题在于我尝试了错误的身份验证方法。使用的方法假定它在google云内部运行,并且正在尝试连接到google元数据服务器。在Google外部运行时,由于明显的原因无法正常工作。

答案是在这里:Migrating 50TB data from local Hadoop cluster to Google Cloud Storage,在所选答案中带有正确的core-site.xml。

应该使用属性 fs.gs.auth.service.account.keyfile 代替 spark.hadoop.google.cloud.auth.service.account.json.keyfile >。唯一的区别是此属性需要 p12 密钥文件而不是 json