Hadoop-将数据集从外部直接复制到HDFS

时间:2018-11-30 23:36:56

标签: hadoop amazon-s3 hdfs distcp

我正在尝试使用distcp将约500 MB的压缩文件复制到HDFS,但出现连接超时错误:

hadoop distcp  hftp://s3.amazonaws.com/path/to/file.gz hdfs://namenode/some/hdfs/dir

出现完整的错误:

  

java.net.SocketTimeoutException:连接在以下位置超时   java.net.PlainSocketImpl.socketConnect(本机方法)位于   java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)     在   java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)     在   java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)     在java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)在   java.net.Socket.connect(Socket.java:589)在   sun.net.NetworkClient.doConnect(NetworkClient.java:175)在   sun.net.www.http.HttpClient.openServer(HttpClient.java:432)位于   sun.net.www.http.HttpClient.openServer(HttpClient.java:527)在   sun.net.www.http.HttpClient。(HttpClient.java:211)在   sun.net.www.http.HttpClient.New(HttpClient.java:308)位于   sun.net.www.http.HttpClient.New(HttpClient.java:326)在   sun.net.www.protocol.http.HttpURLConnection.getNewHttpClient(HttpURLConnection.java:1202)     在   sun.net.www.protocol.http.HttpURLConnection.plainConnect0(HttpURLConnection.java:1138)     在   sun.net.www.protocol.http.HttpURLConnection.plainConnect(HttpURLConnection.java:1032)     在   sun.net.www.protocol.http.HttpURLConnection.connect(HttpURLConnection.java:966)     在   org.apache.hadoop.hdfs.web.HftpFileSystem.openConnection(HftpFileSystem.java:328)     在   org.apache.hadoop.hdfs.web.HftpFileSystem $ LsParser.fetchList(HftpFileSystem.java:461)     在   org.apache.hadoop.hdfs.web.HftpFileSystem $ LsParser.getFileStatus(HftpFileSystem.java:476)     在   org.apache.hadoop.hdfs.web.HftpFileSystem.getFileStatus(HftpFileSystem.java:505)     在org.apache.hadoop.fs.Globber.getFileStatus(Globber.java:64)处   org.apache.hadoop.fs.Globber.doGlob(Globber.java:272)在   org.apache.hadoop.fs.Globber.glob(Globber.java:151)在   org.apache.hadoop.fs.FileSystem.globStatus(FileSystem.java:1715)在   org.apache.hadoop.tools.GlobbedCopyListing.doBuildListing(GlobbedCopyListing.java:77)     在   org.apache.hadoop.tools.CopyListing.buildListing(CopyListing.java:86)     在   org.apache.hadoop.tools.DistCp.createInputFileListing(DistCp.java:429)     在org.apache.hadoop.tools.DistCp.prepareFileListing(DistCp.java:91)     在org.apache.hadoop.tools.DistCp.execute(DistCp.java:181)处   org.apache.hadoop.tools.DistCp.run(DistCp.java:143)在   org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)在   org.apache.hadoop.tools.DistCp.main(DistCp.java:493)

将如此大的文件复制到HDFS的正确方法是什么?我正在使用CDH 5.14。

谢谢!

1 个答案:

答案 0 :(得分:0)

请使用此。

hadoop distcp s3a:// hwdev-examples-ireland / datasets / tmp / datasets2