我希望将遍布多个子目录的大量数据压缩到一个存档中。我不能简单地使用内置的tar函数,因为我需要我的Perl脚本在Windows和Linux环境中工作。我找到了Archive::Tar
模块,但their documentation发出警告:
请注意,此方法[
create_archive()
]不会写on the fly
原样;在写入之前,它仍会将所有文件读入内存 存档。如果这是一个问题,请参阅下面的常见问题解答。
由于数据的庞大规模,我想“即时”编写。但我在常见问题解答中找不到有关编写文件的有用信息。他们建议使用迭代器iter()
:
返回一个迭代器函数,该函数在不加载的情况下读取tar文件 这一切都在记忆中。每次调用该函数时,它都会返回 tarball中的下一个文件。
my $next = Archive::Tar->iter( "example.tar.gz", 1, {filter => qr/\.pm$/} ); while( my $f = $next->() ) { print $f->name, "\n"; $f->extract or warn "Extraction failed"; # .... }
但这只讨论了文件的读取,而不是压缩存档的写入。所以我的问题是,我如何获取目录$dir
并以内存友好的方式递归地将其添加到存档archive.tar.bz2
,即没有先将整个树加载到内存中?
我尝试使用Archive::Tar::Streamed
和IO::Compress::Bzip2
使用评论中的建议构建我自己的脚本,但无济于事。
use strict;
use warnings;
use Archive::Tar::Streamed;
use File::Spec qw(catfile);
use IO::Compress::Bzip2 qw(bzip2 $Bzip2Error);
my ($in_d, $out_tar, $out_bz2) = @ARGV;
open(my $out_fh,'>', $out_tar) or die "Couldn't create archive";
binmode $out_fh;
my $tar = Archive::Tar::Streamed->new($out_fh);
opendir(my $in_dh, $in_d) or die "Could not opendir '$in_d': $!";
while (my $in_f = readdir $in_dh) {
next unless ($in_f =~ /\.xml$/);
print STDOUT "Processing $in_f\r";
$in_f = File::Spec->catfile($in_d, $in_f);
$tar->add($in_f);
}
print STDOUT "\nBzip'ing $out_tar\r";
bzip2 $out_tar => $out_bz2
or die "Bzip2 failed: $Bzip2Error\n";
很快,我的系统内存不足。我目前的系统有32GB可用,但它几乎立即被淹没。我尝试添加到存档的目录中的某些文件超过32GB。
所以我想知道即使在Streamed
类中,每个文件在添加到存档之前都必须在内存中完全读取?我假设文件本身将在缓冲区中流式传输到存档,但也许只是不是先将所有文件保存在内存中,Streamed
只允许在内存中完全需要一个文件,然后将其添加到存档中,一个接一个?
答案 0 :(得分:1)
不幸的是,你想在Perl中使用is not possible:
我同意,如果这个模块能够以块的形式编写文件,然后重写头文件(以保持Archive :: Tar的写作关系),那将是很好的。您可以向后移动存档,因为您知道将文件拆分为
N
条目,删除额外的标题,并使用其大小的总和更新第一个标题。目前唯一的选项是:使用
Archive::Tar::File
,将数据拆分为perl
以外的可管理大小,或直接使用tar
命令(从{{1使用它) CPAN上有一个很好的包装器:perl
)。我认为我们根据
Archive::Tar::Wrapper
在Perl中实现了一个真正的非内存驻留tar
实现。说实话,Archive::Tar
本身需要被其他东西重写或成功。
答案 1 :(得分:1)
这是我的解决方案的原始版本,它仍然将整个文件存储在内存中。我今天可能没有时间添加只存储部分文件的更新,因为Archive::Tar
模块没有最友好的API
use strict;
use warnings 'all';
use autodie; # Remove need for checks on IO calls
use File::Find 'find';
use Archive::Tar::Streamed ();
use Compress::Raw::Bzip2;
use Time::HiRes qw/ gettimeofday tv_interval /;
# Set a default root directory for testing
#
BEGIN {
our @ARGV;
@ARGV = 'E:\test' unless @ARGV;
}
use constant ROOT_DIR => shift;
use constant KB => 1024;
use constant MB => KB * KB;
use constant GB => MB * KB;
STDOUT->autoflush; # Make sure console output isn't buffered
my $t0 = [ gettimeofday ];
# Create a pipe, and fork a child that will build a tar archive
# from the files and pass the result to the pipe as it is built
#
# The parent reads from the pipe and passes each chunk to the
# module for compression. The result of zipping each block is
# written directly to the bzip2 file
#
pipe( my $pipe_from_tar, my $pipe_to_parent ); # Make our pipe
my $pid = fork; # fork the process
if ( $pid == 0 ) { # child builds tar and writes it to the pipe
$pipe_from_tar->close; # Close the parent side of the pipe
$pipe_to_parent->binmode;
$pipe_to_parent->autoflush;
# Create the ATS object, specifiying that the tarred output
# will be passed straight to the pipe
#
my $tar = Archive::Tar::Streamed->new( $pipe_to_parent );
find(sub {
my $file = File::Spec->canonpath( $File::Find::name );
$tar->add( $file );
print "Processing $file\n" if -d;
}, ROOT_DIR );
$tar->writeeof; # This is undocumented but essential
$pipe_to_parent->close;
}
else { # parent reads the tarred data, bzips it, and writes it to the file
$pipe_to_parent->close; # Close the child side of the pipe
$pipe_from_tar->binmode;
open my $bz2_fh, '>:raw', 'T:\test.tar.bz2';
$bz2_fh->autoflush;
# The first parameter *must* have a value of zero. The default
# is to accumulate each zipped chunnk into the output variable,
# whereas we want to write each chunk to a file
#
my ( $bz, $status ) = Compress::Raw::Bzip2->new( 0 );
defined $bz or die "Cannot create bunzip2 object: $status\n";
my $zipped;
while ( my $len = read $pipe_from_tar, my $buff, 8 * MB ) {
$status = $bz->bzdeflate( $buff, $zipped );
$bz2_fh->print( $zipped ) if length $zipped;
}
$pipe_from_tar->close;
$status = $bz->bzclose( $zipped );
$bz2_fh->print( $zipped ) if length $zipped;
$bz2_fh->close;
my $elapsed = tv_interval( $t0 );
printf "\nProcessing took %s\n", hms($elapsed);
}
use constant MINUTE => 60;
use constant HOUR => MINUTE * 60;
sub hms {
my ($s) = @_;
my @ret;
if ( $s > HOUR ) {
my $h = int($s / HOUR);
$s -= $h * HOUR;
push @ret, "${h}h";
}
if ( $s > MINUTE or @ret ) {
my $m = int($s / MINUTE);
$s -= $m * MINUTE;
push @ret, "${m}m";
}
push @ret, sprintf "%.1fs", $s;
"@ret";
}