使用distcp命令复制到s3位置

时间:2016-09-23 07:26:31

标签: hadoop amazon-s3 s3distcp

我使用以下命令将一些数据从HDFS复制到S3:

df:
    0    1     2     3
0   A 1  B 2   C 10  D 15
1   A 5  D 10  G 2   NaN
2   D 6  E 7   NaN   NaN
3   H 7  G 8   NaN   NaN

res:
    A   B   C   D   E   F   G   H
0   1   2   10  15  NaN NaN NaN NaN
1   5   NaN NaN 10  NaN NaN 2   NaN
2   NaN NaN NaN 6   7   NaN NaN NaN
3   NaN NaN NaN NaN NaN NaN 8   7
S3中不存在

$ hadoop distcp -m 1 /user/hive/data/test/test_folder=2015_09_19_03_30 s3a://data/Test/buc/2015_09_19_03_30 桶。它成功地将2015_09_19_03_30目录的数据复制到S3 /user/hive/data/test/test_folder=2015_09_19_03_30存储桶中,但是当我再次执行相同的命令时,它会在S3中创建另一个存储桶。

我希望两个文件都在同一个文件夹中。

1 个答案:

答案 0 :(得分:1)

  

这是你正在尝试的情况,   因为它将新文件放在同一个桶中

// first there is no data
$ hadoop fs -ls s3n://testing/
$

// then dist cp the data in dir input to testing bucket
$ hadoop distcp input/ s3n://testing/
$ hadoop fs -ls s3n://testing/
Found 1 items
drwxrwxrwx   -          0 1970-01-01 00:00 s3n://testing/input
$ hadoop fs -ls s3n://testing/input/
Found 3 items
-rw-rw-rw-   1       1670 2016-09-23 13:23 s3n://testing/input/output
-rw-rw-rw-   1        541 2016-09-23 13:23 s3n://testing/input/some.txt
-rw-rw-rw-   1       1035 2016-09-23 13:23 s3n://testing/input/some2.txt
$
// added new file a.txt in input path
// and executed same command
$ hadoop distcp input/ s3n://testing/
$ hadoop fs -ls s3n://testing/input/
Found 4 items
-rw-rw-rw-   1          6 2016-09-23 13:26 s3n://testing/input/a.txt
-rw-rw-rw-   1       1670 2016-09-23 13:23 s3n://testing/input/output
-rw-rw-rw-   1        541 2016-09-23 13:23 s3n://testing/input/some.txt
-rw-rw-rw-   1       1035 2016-09-23 13:23 s3n://testing/input/some2.txt
$