我正在尝试运行s3distcp,以便将许多小型(200-600KB)文件从S3合并到HDFS。
我正在通过Ubuntu在CDH 4.2上运行Hadoop。
具体来说: Hadoop 2.0.0-cdh4.2.0 Subversion文件:///var/lib/jenkins/workspace/generic-package-ubuntu64-12-04/CDH4.2.0-Packaging-Hadoop-2013-02-15_10-38-54/hadoop-2.0.0+922- 1.cdh.2.2.0.p0.12~precision / src / hadoop-common-project / hadoop-common -r 8bce4bd28a464e0a92950c50ba01a9deb1d85686
我之前通过将它们复制到Hadoop类路径中解决了对aws-java-sdk-1.4.1.jar和s3distcp.jar的所有依赖关系。还安装了Libsnappy1。
但是当我跑步时:
hdfs@test-cdh-03-master:/home/ubuntu$ hadoop jar /usr/lib/hadoop/lib/s3distcp.jar --src 's3n://workdir-XXXX-YYYYlogs/production-YYYYYlogs/Log-FFFFFFF-click/' --dest 'hdfs:///test/' --groupBy 'Log-FFFFF(.*)'
我得到以下错误堆栈:
13/04/08 14:36:30 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/output'
13/04/08 14:36:36 INFO s3distcp.S3DistCp: Created 0 files to copy 0 files
13/04/08 14:36:36 WARN mapred.JobClient: Use GenericOptionsParser for parsing the arguments. Applications should implement Tool for the same.
13/04/08 14:36:37 INFO mapred.JobClient: Cleaning up the staging area hdfs://test-cdh-03-master.extc.test-cdh-03.adswizz.com/tmp/hadoop-temp/mapred/staging/hdfs/.staging/job_201304041515_0016
13/04/08 14:36:37 ERROR security.UserGroupInformation: PriviledgedActionException as:hdfs (auth:SIMPLE) cause:org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/files
13/04/08 14:36:37 INFO s3distcp.S3DistCp: Try to recursively delete hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/tempspace
Exception in thread "main" java.lang.RuntimeException: Error running job
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:586)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:216)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:84)
at com.amazon.external.elasticmapreduce.s3distcp.Main.main(Main.java:12)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:39)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:597)
at org.apache.hadoop.util.RunJar.main(RunJar.java:208)
Caused by: org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: hdfs:/tmp/ab7c0a09-07ba-4592-b354-bcd0dd3d6a07/files
at org.apache.hadoop.mapred.FileInputFormat.listStatus(FileInputFormat.java:194)
at org.apache.hadoop.mapred.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:40)
at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:205)
at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1091)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1083)
at org.apache.hadoop.mapred.JobClient.access$600(JobClient.java:174)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:993)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:946)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:396)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:946)
at org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:920)
at org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:1369)
at com.amazon.external.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:568)
... 9 more
还有什么我应该尝试的吗? 有没有我用正则表达式看不到的问题?