Perl:如何从md5 :: digest addfile()获得“字节读取”?

时间:2019-04-16 17:28:01

标签: perl md5

我正在使用Digest :: MD5计算数据流的MD5;也就是GZIPped文件(准确地说是3000),太大而无法放入RAM。所以我正在这样做:

 use Digest::MD5 qw(md5_base64);

 my ($filename) = @_;                # this is in a sub
 my $ctx = Digest::MD5 -> new;

 $openme = $filename;        # Usually, it's a plain file
 $openme = "gunzip -c '$filename' |" if ($filename =~ /\.gz$/); # is gz

 open (FILE, $openme); # gunzip to STDOUT
 binmode(FILE);
 $ctx -> addfile(*FILE);   # passing filehandle
 close(FILE);

这是成功的。 addfile整齐地吸收了gunzip的输出,并给出了正确的MD5。

但是,我真的很想知道数据集的大小(在这种情况下为压缩后的“文件”)。

我可以再添加一个

  $size = 0 + `gunzip -c very/big-file.gz | wc -c`;

,但这将涉及两次读取文件。

是否有任何方法可以提取Digest :: MD5占用的字节数?我尝试捕获结果:$result = $ctx -> addfile(*FILE);并在$ result和$ ctx上都执行了Data :: Dumper,但是没有发现有趣的事情。

编辑:文件通常不压缩。添加了代码以显示我的实际工作。

2 个答案:

答案 0 :(得分:3)

我将在perl中完成所有操作,而无需依赖外部程序进行解压缩:

#!/usr/bin/perl
use warnings;
use strict;
use feature qw/say/;
use IO::Uncompress::Gunzip qw/$GunzipError/;
use Digest::MD5;

my $filename = shift or die "Missing gzip filename!\n";

my $md5 = Digest::MD5->new;
# Allow for reading both gzip format files and uncompressed files.
# This is the default behavior, but might as well be explicit about it.
my $z = IO::Uncompress::Gunzip->new($filename, Transparent => 1)
  or die "Unable to open $filename: $GunzipError\n";
my $len = 0;

while ((my $blen = $z->read(my $block)) > 0) {
  $len += $blen;
  $md5->add($block);
}
die "There was an error reading the file: $GunzipError\n" unless $z->eof;

say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;

如果您想使用gunzip而不是核心IO::Uncompress::Gunzip模块,尽管可以使用read一次获取一大块数据,但是您可以做类似的事情:

#!/usr/bin/perl
use warnings;
use strict;
use autodie; # So we don't have to explicitly check for i/o related errors
use feature qw/say/;
use Digest::MD5;

my $filename = shift or die "Missing gzip filename!\n";

my $md5 = Digest::MD5->new;
# Note use of lexical file handle and safer version of opening a pipe
# from a process that eliminates shell shenanigans. Also uses the :raw
# perlio layer instead of calling binmode on the handle (which has the
# same effect)
open my $z, "-|:raw", "gunzip", "-c", $filename;
# Non-compressed version
# open my $z, "<:raw", $filename;
my $len = 0;

while ((my $blen = read($z, my $block, 4096)) > 0) {
  $len += $blen;
  $md5->add($block);
}

say "Total uncompressed length: $len";
say "MD5: ", $md5->hexdigest;

答案 1 :(得分:2)

您可以自己阅读内容,并将其输入到$ctx->add($data)中,并不断地统计已通过的数据量。无论是在单个调用中还是在多个调用中添加所有数据,都不会对基础算法产生任何影响。这些文档包括:

    All these lines will have the same effect on the state of the $md5 object:

        $md5->add("a"); $md5->add("b"); $md5->add("c");
        $md5->add("a")->add("b")->add("c");
        $md5->add("a", "b", "c");
        $md5->add("abc");

表示您一次只能执行一次。