对Grace的回应

Question

我正在寻找一种方法来查找位于给定路径中的几个目录中的文件。换句话说，这些目录将具有相同文件名的文件。我的脚本似乎在查找grep文件名进行处理的正确路径时遇到了层次结构问题。我有一个修复路径作为输入，脚本将需要查看路径并从那里查找文件，但我的脚本似乎停留在2层并从那里处理而不是查看层中的最后目录（在我的情况下，这里）它处理“ln”和“nn”并开始处理子程序。

修复输入路径为： -

/nfs/disks/version_2.0/

我想要通过子程序进行后处理的文件将存在于以下几个目录下。基本上我想检查file1.abc是否存在于所有目录temp1，temp2＆amp; temp目录下的temp3。如果存在于nn目录下的temp1，temp2，temp3中，则file2.abc也是如此。

我想要以完整路径检查的文件将是这样的： -

/nfs/disks/version_2.0/dir_a/ln/temp1/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp2/file1.abc
/nfs/disks/version_2.0/dir_a/ln/temp3/file1.abc

/nfs/disks/version_2.0/dir_a/nn/temp1/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp2/file2.abc
/nfs/disks/version_2.0/dir_a/nn/temp3/file2.abc

我的脚本如下： -

#! /usr/bin/perl -w 
my $dir = '/nfs/fm/disks/version_2.0/' ;
opendir(TEMP, $dir) || die $! ;
foreach my $file (readdir(TEMP)) {
    next if ($file eq "." || $file eq "..") ;
    if (-d "$dir/$file") {
        my $d = "$dir/$file";   
        print "Directory:- $d\n" ;
        &getFile($d);
        &compare($file) ;
    }
}

请注意，我将print "Directory:- $d\n" ;放在那里进行调试，并打印出来： -

/nfs/disks/version_2.0/dir_a/
/nfs/disks/version_2.0/dir_b/

所以我知道它处理错误的路径来处理下面的子程序。

有人可以帮我指出我脚本中的错误在哪里？谢谢！

Answer 1

要明确：脚本应该通过目录递归并查找具有特定文件名的文件？在这种情况下，我认为以下代码是问题所在：

if (-d "$dir/$file") {
    my $d = "$dir/$file";   
    print "Directory:- $d\n" ;
    &getFile($d);
    &compare($file) ;
}

我假设&getFile($d)意图进入目录（即递归步骤）。这可以。但是，看起来&compare($file)是您要查看的对象不是目录时要执行的操作。因此，该代码块应如下所示：

if (-d "$dir/$file") {
    &getFile("$dir/$file");  # the recursive step, for directories inside of this one
} elsif( -f "$dir/$file" ){
    &compare("$dir/$file");  # the action on files inside of the current directory
}

一般的伪代码应该是这样的：

sub myFind {
    my $dir = shift;
    foreach my $file( stat $dir ){
        next if $file -eq "." || $file -eq ".."
        my $obj = "$dir/$file";
        if( -d $obj ){
            myFind( $obj );
        } elsif( -f $obj ){
            doSomethingWithFile( $obj );
        }
    }
}
myFind( "/nfs/fm/disks/version_2.0" );

作为旁注：这个脚本正在重新发明轮子。您只需要编写一个脚本来处理单个文件。你可以完全从shell中完成剩下的工作：

find /nfs/fm/disks/version_2.0 -type f -name "the-filename-you-want" -exec your_script.pl {} \;

Answer 2

哇，这就像重温20世纪90年代！ Perl代码有所发展，你真的需要学习新东西。看起来你在版本3.0或4.0中学习了Perl。这里有一些指示：

在命令行中使用use warnings;代替-w。
使用use strict;。这将要求您使用my预先声明变量，如果它们不在本地块中，则将它们范围限定为本地块或文件。这有助于发现很多错误。
不要将&放在子程序名称前面。
使用and，or和not代替&&，||和!。
了解可以为您节省大量时间和精力的Perl模块。

当有人说检测到重复时，我会立即想到哈希。如果您根据文件名使用哈希，则可以轻松查看是否存在重复文件。

当然，哈希只能为每个键提供一个值。幸运的是，在Perl 5.x中，该值可以是对另一个数据结构的引用。

因此，我建议您使用包含对列表的引用的哈希（旧语句中的数组）。您可以将文件的每个实例都推送到该列表。

使用您的示例，您将拥有如下所示的数据结构：

%file_hash = {
    file1.abc => [
       /nfs/disks/version_2.0/dir_a/ln/temp1
       /nfs/disks/version_2.0/dir_a/ln/temp2
       /nfs/disks/version_2.0/dir_a/ln/temp3
    ],
    file2.abc => [
       /nfs/disks/version_2.0/dir_a/nn/temp1
       /nfs/disks/version_2.0/dir_a/nn/temp2
       /nfs/disks/version_2.0/dir_a/nn/temp3
   ];

而且，这是一个程序：

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);        #Can use `say` which is like `print "\n"`;

use File::Basename; #imports `dirname` and `basename` commands
use File::Find;             #Implements Unix `find` command.

use constant DIR => "/nfs/disks/version_2.0";

# Find all duplicates
my %file_hash;
find (\&wanted, DIR);

# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            say "    $dir_name";
        }
    }
}

sub wanted {
    return if not -f $_;    

    if (not exists $file_hash{$_}) {
        $file_hash{$_} = [];
    }
    push @{$file_hash{$_}}, $File::Find::dir;
}

以下是File::Find的一些内容：

工作在子程序wanted。
$_是文件的名称，我可以用它来查看这是文件还是目录
$File::Find::Name是包含路径的文件的全名。
$File::Find::dir是目录的名称。

如果数组引用不存在，我使用$file_hash{$_} = [];创建它。这不是必要的，但我觉得很安慰，它可以防止错误。要将$file_hash{$_}用作数组，我必须取消引用它。我这样做是在@前加@$file_hash{$_}所以@{$file_hash{$_}}或foreach my $file_name (sort keys %file_hash) { if (scalar (@{$file_hash{$file_name}}) > 1) { #say qq(Duplicate File: "$file_name"); foreach my $dir_name (@{$file_hash{$file_name}}) { #say " $dir_name"; open (my $fh, "<", "$dir_name/$file_name") or die qq(Can't open file "$dir_name/$file_name" for reading); # Process your file here... close $fh; } } }。

找到所有文件后，我可以打印出整个结构。我唯一要做的就是检查以确保每个阵列中有多个成员。如果只有一个成员，则没有重复。

对Grace的回应

嗨David W.，非常感谢你的解释和样本脚本。对不起，也许我不清楚我的问题陈述。我想我不能在寻找数据结构的路径中使用哈希。由于文件* .abc是几百个，并且每个文件* .abc甚至具有相同的文件名，但实际上每个目录结构的内容不同。

例如file1.abc位于“/nfs/disks/version_2.0/dir_a/ln/temp1”下与file1.abc所在的内容不同于“/nfs/disks/version_2.0/dir_a/ ln / temp2“和”/ nfs/disks/version_2.0/dir_a/ln/temp3“。我的目的是在每个目录结构（temp1，temp2和temp3）中grep文件列表* .abc，并将文件名列表与主列表进行比较。你能帮忙解释如何解决这个问题吗？谢谢。 - 昨天的格蕾丝

我只是在我的示例代码中打印文件，但是您可以打开它们并处理它们，而不是打印文件。毕竟，您现在拥有文件名和目录。这是我的计划的核心。这一次，我打开文件并查看内容：

wanted

如果您只查找某些文件，可以修改file*.txt功能以跳过您不想要的文件。例如，这里我只查找与/^file.*\.txt$/模式匹配的文件。注意我使用wanted的正则表达式来匹配文件的名称。如您所见，它与之前的-f子例程相同。唯一的区别是我的测试：我正在寻找一个文件（file*.txt）并且名称正确（sub wanted { return if not -f $_ and /^file.*\.txt$/; if (not exists $file_hash{$_}) { $file_hash{$_} = []; } push @{$file_hash{$_}}, $File::Find::dir; }）：

#! /usr/bin/env perl
#
use strict;
use warnings;
use feature qw(say);        #Can use `say` which is like `print "\n"`;

use File::Find;             #Implements Unix `find` command.
use Digest::file qw(digest_file_hex);

use constant DIR         => "/nfs/disks/version_2.0";
use constant MASTER_LIST_DIR => "/some/directory";

# First, I'm going thorugh the MASTER_LIST_DIR directory
# and finding all of the master list files. I'm going to take
# the MD5 hash of those files, and store them in a Perl hash 
# that's keyed by the name of file file. Thus, when I find a 
# file with a matching name, I can compare the MD5 of that file
# and the master file. If they match, the files are the same. If
# not, they're different.

# In this example, I'm inlining the function I use to find the files
# instead of making it a separat function.

my %master_hash;
find (
    {
        %master_hash($_) = digest_file_hex($_, "MD5") if -f;
    },
    MASTER_LIST_DIR
);

# Now I have the MD5 of all the master files, I'm going to search my
# DIR directory for the files that have the same MD5 hash as the
# master list files did. If they do have the same MD5 hash, I'll
# print out their names as before.

my %file_hash;
find (\&wanted, DIR);

# Print out all the duplicates
foreach my $file_name (sort keys %file_hash) {
    if (scalar (@{$file_hash{$file_name}}) > 1) {
        say qq(Duplicate File: "$file_name");
        foreach my $dir_name (@{$file_hash{$file_name}}) {
            say "    $dir_name";
        }
    }
}

# The wanted function has been modified since the last example.
# Here, I'm only going to put files in the %file_hash if they

sub wanted {
    if (-f $_ and $file_hash{$_} = digest_file_hex($_, "MD5")) {
        $file_hash{$_} //= [];    #Using TLP's syntax hint
        push @{$file_hash{$_}}, $File::Find::dir;
    }
}

如果您正在查看文件内容，可以使用MD5 hash来确定文件内容是匹配还是不匹配。这会将文件简化为16到28个字符的字符串，甚至可以用作散列键而不是文件名。这样，具有匹配MD5哈希值（因此匹配内容）的文件将位于相同的哈希列表中。

您谈论文件的“主列表”，似乎您认为此主列表需要匹配您正在查找的文件的内容。所以，我在我的程序中做了一个小模块。我首先采用您所谈到的主列表，并为每个文件生成MD5总和。然后我将查看该目录中的所有文件，但只选择具有匹配的MD5哈希值的文件......

顺便说一下，不已经过测试。

{{1}}

如何在Perl中的给定路径下查找存在于不同目录中的文件

2 个答案:

对Grace的回应