我使用以下命令将一些数据从HDFS复制到S3:
df:
0 1 2 3
0 A 1 B 2 C 10 D 15
1 A 5 D 10 G 2 NaN
2 D 6 E 7 NaN NaN
3 H 7 G 8 NaN NaN
res:
A B C D E F G H
0 1 2 10 15 NaN NaN NaN NaN
1 5 NaN NaN 10 NaN NaN 2 NaN
2 NaN NaN NaN 6 7 NaN NaN NaN
3 NaN NaN NaN NaN NaN NaN 8 7
S3中不存在 $ hadoop distcp -m 1 /user/hive/data/test/test_folder=2015_09_19_03_30 s3a://data/Test/buc/2015_09_19_03_30
桶。它成功地将2015_09_19_03_30
目录的数据复制到S3 /user/hive/data/test/test_folder=2015_09_19_03_30
存储桶中,但是当我再次执行相同的命令时,它会在S3中创建另一个存储桶。
我希望两个文件都在同一个文件夹中。
答案 0 :(得分:1)
这是你正在尝试的情况, 因为它将新文件放在同一个桶中
// first there is no data
$ hadoop fs -ls s3n://testing/
$
// then dist cp the data in dir input to testing bucket
$ hadoop distcp input/ s3n://testing/
$ hadoop fs -ls s3n://testing/
Found 1 items
drwxrwxrwx - 0 1970-01-01 00:00 s3n://testing/input
$ hadoop fs -ls s3n://testing/input/
Found 3 items
-rw-rw-rw- 1 1670 2016-09-23 13:23 s3n://testing/input/output
-rw-rw-rw- 1 541 2016-09-23 13:23 s3n://testing/input/some.txt
-rw-rw-rw- 1 1035 2016-09-23 13:23 s3n://testing/input/some2.txt
$
// added new file a.txt in input path
// and executed same command
$ hadoop distcp input/ s3n://testing/
$ hadoop fs -ls s3n://testing/input/
Found 4 items
-rw-rw-rw- 1 6 2016-09-23 13:26 s3n://testing/input/a.txt
-rw-rw-rw- 1 1670 2016-09-23 13:23 s3n://testing/input/output
-rw-rw-rw- 1 541 2016-09-23 13:23 s3n://testing/input/some.txt
-rw-rw-rw- 1 1035 2016-09-23 13:23 s3n://testing/input/some2.txt
$