Question

我想要分析100,000个文件。具体来说，我想从任意大小的文件样本中计算可打印字符的百分比。其中一些文件来自大型机，Windows，Unix等，因此很可能包含二进制和控制字符。

我开始使用Linux“文件”命令，但它没有为我的目的提供足够的细节。以下代码传达了我想要做的事情，但并不总是有效。

    #!/usr/bin/perl -n

    use strict;
    use warnings;

    my $cnt_n_print = 0;
    my $cnt_print = 0;
    my $cnt_total = 0;
    my $prc_print = 0;

    #Count the number of non-printable characters
    while ($_ =~ m/[^[:print:]]/g) {$cnt_n_print++};

    #Count the number of printable characters
    while ($_ =~ m/[[:print:]]/g) {$cnt_print++};

    $cnt_total = $cnt_n_print + $cnt_print;
    $prc_print = $cnt_print/$cnt_total;

    #Print the # total number of bytes read followed by the % printable
    print "$cnt_total|$prc_print\n"

这是一个有效的测试电话：

    echo "test_string of characters" | /home/user/scripts/prl/s16_count_chars.pl

这是我打算调用它的方式，适用于一个文件：

    find /fct/inbound/trans/ -name "TRNST.20121115231358.xf2" -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

这不能正常工作：

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | /home/user/scripts/prl/s16_count_chars.pl

这两个都没有：

    find /fct/inbound/trans/ -type f -print0 | xargs -0 head -c 2000 | perl -0 /home/user/scripts/prl/s16_count_chars.pl

不是为find返回的EACH行执行一次脚本，而是为所有结果执行ONCE。

提前致谢。

到目前为止的研究：

Pipe和XARGS以及分隔符

http://help.lockergnome.com/linux/help-understand-pipe-xargs--ftopict549399.html

http://en.wikipedia.org/wiki/Xargs#The_separator_problem

的澄清（S）：
1.）所需输出：如果目录中有932个文件，则输出将是932行文件名列表，从文件读取的总字节数和％是可打印字符。 2.）许多文件都是二进制文件。脚本需要处理嵌入式二进制eol或eof序列 3.）许多文件很大，所以我只想读取第一个/最后一个xx字节。我一直在尝试使用head -c 256或tail -c 128分别读取前256个字节或后128个字节。解决方案可以在管道中工作，也可以限制perl脚本中的字节。

Answer 1

-n选项将整个代码包装在while(defined($_=<ARGV>) { ... }块中。这意味着您的my $cnt_print和其他变量声明会针对每一行输入重复，实际上会重置所有变量值。

解决方法是使用全局变量（如果您想继续使用our），请使用use strict声明它们，而不是将它们初始化为0，因为它们会重新初始化每一行输入。你可以说像

our $cnt_print //= 0;

如果您不希望第一行输入未定义$cnt_print及其朋友。

请参阅具有类似问题的this recent question。

Answer 2

你可以让find一次传递一个arg。

find /fct/inbound/trans/ -type f -exec perl script.pl {} \;

但我会继续通过xargs或使用GNU find的{{1}}一次传递多个文件。

-exec +

以下代码段支持两者。

您可以一次继续阅读一行：

find /fct/inbound/trans/ -type f -exec perl script.pl {} +

或者您可以一次阅读整个文件：

#!/usr/bin/perl

use strict;
use warnings;

my $cnt_total   = 0;
my $cnt_n_print = 0;

while (<>) {
    $cnt_total += length;
    ++$cnt_n_print while /[^[:print:]]/g;
} continue {
    if (eof) {
        my $cnt_print = $cnt_total - $cnt_n_print;
        my $prc_print = $cnt_print/$cnt_total;

        print "$ARGV: $cnt_total|$prc_print\n";

        $cnt_total   = 0;
        $cnt_n_print = 0;
    }
}

Answer 3

根据提供的反馈，这是我的工作解决方案。

如果您对表单或更有效的方法有任何进一步的反馈，我将不胜感激：

    #!/usr/bin/perl

    use strict;
    use warnings;

    # This program receives a file path and name.
    # The program attempts to read the first 2000 bytes.
    # The output is a list of files, the number of bytes
    # actually read and the percent of tbe bytes that are
    # ASCII "printable" aka [\x20-\x7E].

    my ($data, $n_bytes, $file_name, $cnt_n_print, $cnt_print, $prc_print);

    # loop through each file
    foreach(@ARGV) {
       $file_name = shift or die "Pass the file name on the command line.\n";

       # open the file read only with "<" in "<$file_name"
       open(FILE, "<$file_name") or die "Can't open $file_name: $!";

       # open each file in binary mode to handle non-printable characters
       binmode FILE;

       # try to read 2000 bytes from FILE, save the results in $data and the
       # actual number of bytes read in $n_bytes
       $n_bytes = read FILE, $data, 2000;

       $cnt_n_print = 0;
       $cnt_print = 0;

       # count the number of non-printable characters
       ++$cnt_n_print while ($data =~ m/[^[:print:]]/g);

       $cnt_print = $n_bytes - $cnt_n_print;
       $prc_print = $cnt_print/$n_bytes;

       print "$file_name|$n_bytes|$prc_print\n";
       close(FILE);
    }

以下是如何调用上述脚本的示例：

    find /some/path/to/files/ -type f -exec perl this_script.pl {} +

以下是我发现有用的参考文献列表：

POSIX Bracket Expressions
Opening files in binmode
Read function
Open file read only

PERL可计算不可打印的字符数

3 个答案: