问题中的代码分析

Question

我想使用Perl计算2个文件之间存在的公共行数。

如果所有行（由换行符分隔\ n）存在于fileA中，我有1个基本文件用于比较。我所做的是将基本文件中的所有行放入base_config散列，将fileA中的行放入config散列。我想比较％config中的所有键，它也可以在％base_config的键中找到。为了更有效地比较密钥，我将密钥排在％base_config中并将它们放入@sorted_base_config。

但是，对于一些行具有完全相同但顺序不同的文件，我无法获得正确的计数。例如，基本文件包含：

hello
hi
tired
sleepy

而fileA包含：

hi
tired
sleepy
hello

我能够读取文件中的值并将它们放入各自的哈希和数组中。以下是代码出错的部分：

$count=0;
while(($key,$value)=each(%config))
{
    foreach (@sorted_base_config) 
    {
        print "config: $config{$key}\n";
                print "\$_: $_\n";
        if($config{$key} eq $_)
        {
            $count++;
        }
    }
}

如果我有任何错误，有人可以告诉我吗？计数假设为4，但它始终保持打印2。

编辑：这是我的原始代码不起作用。它看起来很不一样，因为我试图使用不同的方法来解决问题。但是，我仍然遇到同样的问题。

#open base config file and load them into the base_config hash
open BASE_CONFIG_FILE, "< script/base.txt" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $base_config{$word1} = $word1;
}
#sort BASE_CONFIG_FILE
@sorted_base_config = sort keys %base_config;

#open config file and load them into the config hash
open CONFIG_FILE, "< script/hello.txt" or die;
my %config;
while (my $line=<CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $config{$word1} = $word1;
}
#sort CONFIG_FILE
@sorted_config = sort keys %config;

%common={};
$count=0;
while(($key,$value)=each(%config))
{
    $num=keys(%base_config);
    $num--;#to get the correct index
    #print "$num\n";
    while($num>=0)
    {
        #check if all the strings in BASE_CONFIG_FILE can be found in CONFIG_FILE
        $common{$value}=$value if exists $base_config{$key};
        #print "yes!\n" if exists $base_config{$key};
        $num--;
    }
}
print "count: $count\n";

while(($key,$value)=each(%common))
{
    print "key: ".$key."\n";
    print "value: ".$value."\n";
}
$num=keys(%common)-1;
print "common lines: ".$num;

以前，我将base_config文件和fileA中存在的公用密钥推送到％common。我希望以后将公共密钥打印到txt文件中，无论在fileA中找到但在base_config文件中找不到的内容都将输出到另一个txt文件。但是，我已经陷入寻找共同密钥的初始阶段。

我正在使用“\ n”分割成用于存储的键，因此我无法使用将删除“\ n”的chomp函数。

编辑2：我刚刚意识到我的代码出了什么问题。在我的txt文件的末尾，我需要添加“\ n”以使其工作。感谢你的帮助！：d

Answer 1

我认为你对效率的尝试实际上会减慢速度。

my %listA;

# Read first file (name in $NameA)
{
    open my $fileA, '<', "$NameA" or die $!;
    while (<$fileA>)
    {
        chomp;
        $listA{$_}++;
    }
}

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die $!;
    while (<$fileB>)
    {
        chomp;
        if ($listA{$_})
        {
            print "Line appears in $NameB once and $listA{$_} times in $NameA: $_\n";
        }
    }
}

如果你想将第二个文件读入哈希值，那么它也可以工作：

现在，如果两个文件中都出现特定行，则会列出该行。请注意，即使我按排序顺序显示键，我也使用哈希查找，因为通过两个排序的数组进行混洗会更快。当然，你很难衡量4行文件的任何差异。对于大文件，读取文件和打印结果的I / O时间可能会占用查找时间。

my %listB;

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die $!;
    while (<$fileB>)
    {
        chomp;
        $listB{$_}++;
    }
}

foreach my $key (sort keys %listA)
{
    if ($listB{$key})
    {
        print "$NameA: $listA{$key}; $NameB: $listB{$key}; $key\n";
    }
}

根据需要重新组织输出。

~~未经测试的代码！~~ 现已测试代码 - 见下文。

转换为测试代码

数据：FileA

hello
hi
tired
sleepy

数据：FileB

hi
tired
sleepy
hello

计划：ppp.pl

#!/usr/bin/env perl
use strict;
use warnings;

my $NameA = "fileA";
my $NameB = "fileB";

my %listA;

# Read first file (name in $NameA)
{
    open my $fileA, '<', "$NameA" or die "Failed to open $NameA: $!\n";
    while (<$fileA>)
    {
        chomp;
        $listA{$_}++;
    }
}

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die "Failed to open $NameB: $!\n";
    while (<$fileB>)
    {
        chomp;
        if ($listA{$_})
        {
            print "Line appears in $NameB once and $listA{$_} times in $NameA: $_\n";
        }
    }
}

输出

$ perl ppp.pl
Line appears in fileB once and 1 times in fileA: hi
Line appears in fileB once and 1 times in fileA: tired
Line appears in fileB once and 1 times in fileA: sleepy
Line appears in fileB once and 1 times in fileA: hello
$

请注意，这是以fileB的顺序列出的东西，因为它应该给出循环读取文件B并依次检查每一行。

代码：qqq.pl

这是第二个片段变成了一个完整的工作程序。

#!/usr/bin/env perl
use strict;
use warnings;

my $NameA = "fileA";
my $NameB = "fileB";

my %listA;

# Read first file (name in $NameA)
{
    open my $fileA, '<', "$NameA" or die "Failed to open $NameA: $!\n";
    while (<$fileA>)
    {
        chomp;
        $listA{$_}++;
    }
}

my %listB;

# Read second file (name in $NameB)
{
    open my $fileB, '<', "$NameB" or die "Failed to open $NameB: $!\n";
    while (<$fileB>)
    {
        chomp;
        $listB{$_}++;
    }
}

foreach my $key (sort keys %listA)
{
    if ($listB{$key})
    {
        print "$NameA: $listA{$key}; $NameB: $listB{$key}; $key\n";
    }
}

输出：

$ perl qqq.pl
fileA: 1; fileB: 1; hello
fileA: 1; fileB: 1; hi
fileA: 1; fileB: 1; sleepy
fileA: 1; fileB: 1; tired
$

请注意，键按排序顺序列出，而不是fileA或fileB中的顺序。

偶尔会发生小小的奇迹！除了添加5行序言（shebang，2 x using，2 x my）之外，两个程序片段的代码根据我第一次为这两个程序计算而正确。（哦，我改进了无法打开文件的错误消息，至少确定了我无法打开的文件。ikegami编辑了我的代码（谢谢！），一致地添加chomp个调用，以及print操作的新行，现在需要显式换行符。）

我不会声称这是伟大的Perl代码;它肯定不会赢得（代码）高尔夫比赛。但它似乎确实有用。

问题中的代码分析

open BASE_CONFIG_FILE, "< script/base.txt" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $base_config{$word1} = $word1;
}

拆分很奇怪...你有一条以换行符结尾的行，并且你在换行符处拆分，因此$word2为空，$word1包含该行的其余部分。然后，将值$word1（不是$word2，如我乍一看）存储到基本配置中。因此每个条目的键和值都相同。异常。实际上并没有错，但......不寻常。第二个循环基本上是相同的（我们都应该拍摄不使用单个子对我们进行读取）。

您无法使用use strict;和use warnings; - 请注意，我对代码执行的第一件事就是添加它们。我只用Perl编程了大约20年，而且我知道我不知道如果没有它们就冒着运行代码的风险。您排序的数组%common，$count，$num，$key，$value不是my'd。这次可能没什么坏处，但是......这是一个不好的迹象。始终，但始终使用use strict; use warnings;，直到您对Perl有足够的了解，不需要提出有关它的问题（并且不要指望它很快就会出现）。

当我运行它时，有以下地点：

my %common={};  # line 32 - I added diagnostic printing
my $count=0;

Perl告诉我：

Reference found where even-sized list expected at rrr.pl line 32, <CONFIG_FILE> line 4.

糟糕 - 那些{}应该是一个空列表()。了解为什么在启用警告的情况下运行！

然后，在

 50 while(my($key,$value)=each(%common))
 51 {
 52     print "key: ".$key."\n";
 53     print "value: ".$value."\n";
 54 }

Perl告诉我：

key: HASH(0x100827720)
Use of uninitialized value $value in concatenation (.) or string at rrr.pl line 53, <CONFIG_FILE> line 4.

这是%common投掷循环的第一个条目。

固定代码：`rrr.pl`

#!/usr/bin/env perl
use strict;
use warnings;

#open base config file and load them into the base_config hash
open BASE_CONFIG_FILE, "< fileA" or die;
my %base_config;
while (my $line=<BASE_CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $base_config{$word1} = $word1;
   print "w1 = <<$word1>>; w2 = <<$word2>>\n";
}

{ print "First file:\n"; foreach my $key (sort keys %base_config) { print "$key => $base_config{$key}\n"; } }

#sort BASE_CONFIG_FILE
my @sorted_base_config = sort keys %base_config;

#open config file and load them into the config hash
open CONFIG_FILE, "< fileB" or die;
my %config;
while (my $line=<CONFIG_FILE>) {
   (my $word1,my $word2) = split /\n/, $line;
   $config{$word1} = $word1;
   print "w1 = <<$word1>>; w2 = <<$word2>>\n";
}
#sort CONFIG_FILE
my @sorted_config = sort keys %config;

{ print "Second file:\n"; foreach my $key (sort keys %base_config) { print "$key => $base_config{$key}\n"; } }

my %common=();
my $count=0;
while(my($key,$value)=each(%config))
{
    print "Loop: $key = $value\n";
    my $num=keys(%base_config);
    $num--;#to get the correct index
    #print "$num\n";
    while($num>=0)
    {
        #check if all the strings in BASE_CONFIG_FILE can be found in CONFIG_FILE
        $common{$value}=$value if exists $base_config{$key};
        #print "yes!\n" if exists $base_config{$key};
        $num--;
    }
}
print "count: $count\n";

while(my($key,$value)=each(%common))
{
    print "key: $key -- value: $value\n";
}
my $num=keys(%common);
print "common lines: $num\n";

输出：

$ perl rrr.pl
w1 = <<hello>>; w2 = <<>>
w1 = <<hi>>; w2 = <<>>
w1 = <<tired>>; w2 = <<>>
w1 = <<sleepy>>; w2 = <<>>
First file:
hello => hello
hi => hi
sleepy => sleepy
tired => tired
w1 = <<hi>>; w2 = <<>>
w1 = <<tired>>; w2 = <<>>
w1 = <<sleepy>>; w2 = <<>>
w1 = <<hello>>; w2 = <<>>
Second file:
hello => hello
hi => hi
sleepy => sleepy
tired => tired
Loop: hi = hi
Loop: hello = hello
Loop: tired = tired
Loop: sleepy = sleepy
count: 0
key: hi -- value: hi
key: tired -- value: tired
key: hello -- value: hello
key: sleepy -- value: sleepy
common lines: 4
$

Answer 2

也许这不是你想要的方法，但是如果你更喜欢这样的话会怎么样：

#!/usr/bin/perl
use Data::Dumper;
use warnings;
use strict;

my @sorted_base_config = qw(hello hi tired sleepy);
my @file_a = qw(hi tired sleepy hello);
my @found_in_both = ();

foreach (@sorted_base_config) {
  if (grep /$_/, @file_a) {
    push(@found_in_both, $_);
  }
}

print "These items were found in file_a:\n";
print Dumper(@found_in_both);

基本上，不是做键/值哈希事情......为什么不尝试使用两个数组并使用foreach作为基本文件数组。当您浏览@sorted_base_config的每一行时，检查是否可以在@file_a中找到该字符串。

由您决定如何将文件放入@sorted_base_config和@file_a数组（以及如何处理换行符或换行符）取决于您。但至少，通过这种方式，它似乎可以更准确地检查哪些词匹配。

Answer 3

如果没有看到你如何定义和填充％config和@sorted_base_config变量，我不确定是什么导致你的代码失败。如果您提供运行上面代码的输出，那将更加明显。

我没有像其他答案一样提供全新的方法，而是尝试“修复”你的方法，但我的工作没有任何问题。这意味着错误实际上在于您填充变量的方式，而不是您的检查方式。

为了简化匹配代码，我将密钥和值分配为从文件中读取的内容。

此代码：

#!C:\Perl\bin\perl
use strict;
use warnings;

my $f1 = $ARGV[0];
my $f2 = $ARGV[1];
my %config_base;
my %config;
my $line;
print "F1 = $f1\nF2 = $f2\n";

open F1, '<', $f1 || die;
while ($line = <F1>) {
chomp $line;
print "adding $line\n";
$config_base{$line}=$line;
}
close F1;
open F2, '<', $f2 || die;
while ($line = <F2>) {
chomp $line;
print "adding $line\n";
$config{$line}=$line;
}
close F2;
my $count=0;
my $key; my $value;
my @sorted_base_config = sort keys %config_base;
while(($key,$value)=each(%config))
{
    foreach (@sorted_base_config) 
    {
        print "config: $config{$key}\n";
                print "\$_: $_\n";
        if($config{$key} eq $_)
        {
            $count++;
        }
    }
}
print "Count = $count\n";

输出结果：

F1 = config_base.txt
F2 = config.txt
adding hello
adding hi
adding tired
adding sleepy
adding hi
adding tired
adding sleepy
adding hello
config: hi
$_: hello
config: hi
$_: hi
config: hi
$_: sleepy
config: hi
$_: tired
config: hello
$_: hello
config: hello
$_: hi
config: hello
$_: sleepy
config: hello
$_: tired
config: tired
$_: hello
config: tired
$_: hi
config: tired
$_: sleepy
config: tired
$_: tired
config: sleepy
$_: hello
config: sleepy
$_: hi
config: sleepy
$_: sleepy
config: sleepy
$_: tired
Count = 4

然而，Johnathan的答案是比你开始时更好的方法。至少，使用exists来比较2个输入散列的键比针对键数组的嵌套循环要好得多。循环失败了开始使用哈希的效率。

在这种情况下，你会有类似的东西：

foreach my $key (keys %config_base) 
    {
        print "config: $config{$key}\n";
                print "\$_: $key\n";
        if(exists $config{$key})
        {
            $count++;
        }
    }
print "Count = $count\n";

Answer 4

使用List::Compare

在Perl中比较2个字符串的问题

4 个答案:

转换为测试代码

数据：FileA

数据：FileB

计划：ppp.pl

输出

代码：qqq.pl

输出：

问题中的代码分析

固定代码：`rrr.pl`

输出：

在Perl中比较2个字符串的问题

4 个答案:

转换为测试代码

数据：FileA

数据：FileB

计划：ppp.pl

输出

代码：qqq.pl

输出：

问题中的代码分析

固定代码：rrr.pl

输出：

固定代码：`rrr.pl`