s3distcp不会将多个文件从HDFS复制到S3

时间:2019-05-29 20:13:17

标签: apache-spark amazon-s3 hdfs amazon-emr s3distcp

EMR作业从文件存储路径(例如,sparkOutput/walk_output_table)可以注意到,在hdfs:///sparkOutput/walk_output_table/issue_date_year=2019/sub_product_type=Student/industry_type=Bank/subject_key_partition=4/part-00144-af3420c7-85f3-4878-a039-1ee3a50bcb76.c000.snappy.parquet位置有多个分区的hdfs中写出了HDR。试图在各种变体中使用--srcPattern标签,甚至只是尝试复制一个文件而没有运气。下面是一个示例,出于隐私原因,baseBucket是组成的存储桶名称:

s3-dist-cp --src hdfs:///sparkOutput/walk_output_table --dest s3://baseBucket/output/hive/staging.3/loan_processed.4/append_month=2019-02-28 --srcPattern .*\.snappy.parquet --s3ServerSideEncryption
19/05/29 19:42:39 INFO s3distcp.S3DistCp: Running with args: -libjars /usr/share/aws/emr/s3-dist-cp/lib/commons-httpclient-3.1.jar,/usr/share/aws/emr/s3-dist-cp/lib/commons-logging-1.0.4.jar,/usr/share/aws/emr/s3-dist-cp/lib/guava-18.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp-2.10.0.jar,/usr/share/aws/emr/s3-dist-cp/lib/s3-dist-cp.jar --src hdfs:///sparkOutput/walk_output_table --dest s3://baseBucket/output/hive/staging.3/loan_processed.4/append_month=2019-02-28 --srcPattern .*.snappy.parquet --s3ServerSideEncryption
19/05/29 19:42:40 INFO s3distcp.S3DistCp: S3DistCp args: --src hdfs:///sparkOutput/walk_output_table --dest s3://baseBucket/output/hive/staging.3/loan_processed.4/append_month=2019-02-28 --srcPattern .*.snappy.parquet --s3ServerSideEncryption
19/05/29 19:42:40 INFO s3distcp.S3DistCp: Using output path 'hdfs:/tmp/b91eaaf0-0c5e-456c-b06d-ff6cedde45f5/output'
19/05/29 19:42:40 INFO s3distcp.S3DistCp: GET http://169.254.169.254/latest/meta-data/placement/availability-zone result: us-east-1a
19/05/29 19:42:41 ERROR s3distcp.S3DistCp: Failed to get source file system
java.io.FileNotFoundException: File does not exist: hdfs:/sparkOutput/walk_output_table
    at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1444)
    at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1452)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:795)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
    at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Exception in thread "main" java.lang.RuntimeException: Failed to get source file system
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:798)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:705)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:76)
    at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:90)
    at com.amazon.elasticmapreduce.s3distcp.Main.main(Main.java:22)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at org.apache.hadoop.util.RunJar.run(RunJar.java:234)
    at org.apache.hadoop.util.RunJar.main(RunJar.java:148)
Caused by: java.io.FileNotFoundException: File does not exist: hdfs:/sparkOutput/walk_output_table
    at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1444)
    at org.apache.hadoop.hdfs.DistributedFileSystem$27.doCall(DistributedFileSystem.java:1437)
    at org.apache.hadoop.fs.FileSystemLinkResolver.resolve(FileSystemLinkResolver.java:81)
    at org.apache.hadoop.hdfs.DistributedFileSystem.getFileStatus(DistributedFileSystem.java:1452)
    at com.amazon.elasticmapreduce.s3distcp.S3DistCp.run(S3DistCp.java:795)
    ... 10 more

任何解决方法或建议,或指出愚蠢的错误,将不胜感激!

0 个答案:

没有答案