从HDFS下载tarball并动态解压缩

时间:2018-04-24 21:17:46

标签: hadoop hdfs tar

我将大型数据集存储为HDFS中的(非压缩)tarball。 tarball的大小约为250Gib。

我想下载这个tarball并动态解压缩,以节省我机器的快速SSD。我希望避免先用hadoop fs -get ...抓住它然后在本地解开它。

目前,我使用hadoop fs -cat抓取它并将其传输到tar,使用pv作为进度条:

hadoop fs -cat my_big_tar.tar | pv -s "$TAR_SIZE" | tar xf -

然而,当我这样做的时候,我在解开时会得到一些(非致命的)错误,并且最终输出结果正常,但是缺少一些数据(几十个GiB)。错误看起来像这样:

Grabbing data from hadoop and untaring on the fly...                                                                     
tar: Skipping to next header============================================================>               ] 81% ETA 0:05:40
tar: Archive contains ‘\201\021\260e\210\333\357J\201\200\av’ where numeric off_t value expected        ] 81% ETA 0:05:37
tar: Archive contains ‘W\341\034\267\t\0ꑻ\317{\374’ where numeric off_t value expected                                   
tar: Archive contains ‘s{AZ\224\235 F.\317\342d’ where numeric off_t value expected======>              ] 82% ETA 0:05:10
tar: Archive contains ‘\264\357\036\272ud.W\235cL\204’ where numeric off_t value expected====>          ] 86% ETA 0:03:50
tar: Archive contains ‘\251\203\204\236\207\374\246"\255\240i\017’ where numeric off_t value expected                    
tar: Archive contains ‘T\242\b[(\372\357*e\032\255S’ where numeric off_t value expected======>          ] 87% ETA 0:03:46
tar: Archive contains ‘\300굕\277t\025o\207\373CK’ where numeric off_t value expected========>           ] 87% ETA 0:03:37
tar: Archive base-256 value is out of off_t range=============================================>         ] 88% ETA 0:03:26
tar: Archive contains ‘\204\274\234\366z\335<D\201-\306\361’ where numeric off_t value expected         ] 88% ETA 0:03:24
tar: Archive contains ‘\341ֶ\207\334-5\034\267C\v\017’ where numeric off_t value expected======>        ] 88% ETA 0:03:18
tar: Archive contains ‘c\3307\247\343ჯ\033瓸’ where numeric off_t value expected===============>         ] 89% ETA 0:03:11
tar: Archive contains ‘Vj+&!\242f$\212\374_\276’ where numeric off_t value expected=============>       ] 91% ETA 0:02:35
tar: Archive contains ‘\v5\374\273\375\302e\251ݝ\247O’ where numeric off_t value expected=======>       ] 91% ETA 0:02:33
tar: Archive contains ‘\027ȷJ\316j\203\025\027\033\264R’ where numeric off_t value expected=====>       ] 91% ETA 0:02:21
tar: Archive contains ‘Ks[L\325x\005\341\301’ where numeric off_t value expected================>       ] 92% ETA 0:02:19
tar: Archive contains obsolescent base-64 headers================================================>      ] 92% ETA 0:02:12
<snip>
tar: Archive contains ‘\177q\375\230Y<QE\0\367\242\207’ where numeric off_t value expected=============> ] 99% ETA 0:00:1
tar: Archive contains ‘\264e\260k\340,d\206\242^\022\032’ where numeric off_t value expected===========> ] 99% ETA 0:00:0
tar: Exiting with failure status due to previous errors================================================> ] 99% ETA 0:00:0
 260GB 0:28:54 [ 154MB/s] [============================================================================>] 100%   

首先使用hadoop fs -get my_tar.tar从Hadoop复制数据,然后解压缩工作正常。

这是我的hadoop version输出:

Hadoop 2.7.2
Subversion Unknown -r Unknown
Compiled by <redacted> on 2016-04-21T22:04Z
Compiled with protoc 2.5.0
From source with checksum d0fda26633fa762bff87ec759ebe689c
This command was run using /opt/hadoop/hadoop-bin-2.7.2-1/share/hadoop/common/hadoop-common-2.7.2.jar  

完整脚本位于此处:https://github.com/AndreiBarsan/dotfiles/blob/master/bin/get-hdfs-tar.sh

使用hadoop fs -cat时可能导致这些错误的原因是什么? (也许一些流浪的hadoop日志输出混合到tar正在读取的管道中?我该如何检查?)

0 个答案:

没有答案