使用distcp将数据从cloudera集群复制到google cloud hdfs集群

时间:2016-04-27 21:19:40

标签: hadoop google-cloud-storage google-cloud-platform cloudera-cdh cloudera-quickstart-vm

我正在使用cloudera quickstart vm。我昨天开始玩谷歌云平台。我正在尝试将cloudera hdfs中的数据复制到 1.谷歌云存储(gs:// bucket_name /) 2.谷歌云hdfs集群(使用hdfs:// google_cluster_namenode:8020 /)

  1. 我按照this post

    中的说明设置了服务帐户身份验证并配置了我的cloudera core-site.xml
    hadoop fs -cp hdfs://quickstart.cloudera:8020/path_to_copy/ gs://bucket_name/
    
  2. 工作正常。但是,我无法使用distcp复制到谷歌云存储。我收到以下错误。我知道这不是一个URI问题。还有什么我想念的吗?

    Error: java.io.IOException: File copy failed: hdfs://quickstart.cloudera:8020/path_to_copy/file --> gs://bucket_name/file
    at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:284)
    at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:252)
    at org.apache.hadoop.tools.mapred.CopyMapper.map(CopyMapper.java:50)
    at org.apache.hadoop.mapreduce.Mapper.run(Mapper.java:145)
    at org.apache.hadoop.mapred.MapTask.runNewMapper(MapTask.java:787)
    at org.apache.hadoop.mapred.MapTask.run(MapTask.java:341)
    at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.java:163)
    at java.security.AccessController.doPrivileged(Native Method)
    at javax.security.auth.Subject.doAs(Subject.java:415)
    at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1671)
    at org.apache.hadoop.mapred.YarnChild.main(YarnChild.java:158) 
    Caused by: java.io.IOException: Couldn't run retriable-command: Copying hdfs://quickstart.cloudera:8020/path_to_copy/file to gs://bucket_name/file
    at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:101)
    at org.apache.hadoop.tools.mapred.CopyMapper.copyFileWithRetry(CopyMapper.java:280)
    ... 10 more 
    Caused by: java.lang.IllegalArgumentException: java.net.URISyntaxException: Relative path in absolute URI: gs://bucket_name.distcp.tmp.attempt_1461777569169_0002_m_000001_2
    at org.apache.hadoop.fs.Path.initialize(Path.java:206)
    at org.apache.hadoop.fs.Path.<init>(Path.java:116)
    at org.apache.hadoop.fs.Path.<init>(Path.java:94)
    at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.getTmpFile(RetriableFileCopyCommand.java:233)
    at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doCopy(RetriableFileCopyCommand.java:107)
    at org.apache.hadoop.tools.mapred.RetriableFileCopyCommand.doExecute(RetriableFileCopyCommand.java:100)
    at org.apache.hadoop.tools.util.RetriableCommand.execute(RetriableCommand.java:87)
    ... 11 more
    
    1. 我无法将distcp连接到google cloud hdfs namenode;我正在“重试连接到服务器”。我找不到任何文档来配置cloudera hdfs集群和google cloud hdfs集群之间的连接。我假设服务帐户auth也应该与谷歌hdfs一起使用。是否有可用于在集群之间设置副本的参考文档?我还缺少其他任何身份验证设置吗?

1 个答案:

答案 0 :(得分:0)

事实证明我必须修改防火墙规则以允许来自ip的tcp / http我正在运行distcp。检查GCP计算实例上的网络防火墙。