旧答案

Question

如何使用C或Perl将文件的所有元素与另一个文件的所有元素进行比较，以获得更大的数据？文件1包含100,000个这样的数字，文件2包含500,000个元素。

我在foreach中使用foreach来拆分数组中的每个元素。它在perl中运行得很好，但是从file1中的File2检查和打印每一列元素的时间是40分钟。有28个这样的专栏。

有没有办法减少时间或使用其他语言，如C？

文件1：

0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2

文件2：

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11    0.12    0.13    0.14    0.15    0.16    0.17    0.18    0.19    0.2 0.21    0.22    0.23    0.24    0.25    0.26    0.27    0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11    1.12    1.13    1.14    1.15    1.16    1.17    1.18    1.19    1.2 1.21    1.22    1.23    1.24    1.25    1.26    1.27    1.28

编辑：

预期输出：

如果文件2中的元素匹配则打印'列号'，如果不打印'0'。

1  2  0  0  0  0  0  0  0  10  11  12  13  14  15  16  17  18  19  20  0   0  0  0  0  0  0  0   
0  0  0  0  0  0  0  0  0   0   0  0   0   0   0   0   0   0   0   0  0   0  0  0  0  0  0  0

这是我正在使用的代码。注意：它在文件1中按列方式检查File2，如果 true ，则打印列号，如果 false ，则打印'0'。它将打印28个不同文件中每列的输出。

#!/usr/bin/perl-w
chomp($file = "File1.txt");
open(FH, $file);
@k_org = <FH>;
chomp($hspfile = 'file2.txt');
open(FH1, $hspfile);
@hsporg = <FH1>;
for $z (1 .. 28) {
  open(OUT, ">$z.txt");
  foreach (@hsporg) {
    $i = 0;
    @h_org = split('\t', $_);
    chomp ($h_org[0]);
    foreach(@k_org) {
      @orginfo = split('\t', $_);
      chomp($orginfo[0]);
      if($h_org[0] eq $orginfo[0]) {
        print OUT "$z\n";
        $i = 1;
        goto LABEL;
      } elsif ($h_org[0] ne $orginfo[0]) {
        if($h_org[0]=~/(\w+\s\w+)\s/) {
          if($orginfo[0] eq $1) {
            print  OUT "0";
            $i = 1;
            goto LABEL;
          }
        }
      }
    }
    if ($i == 0) {
      print OUT "0";
    }
    LABEL: 
  }
}
close FH;
close FH1;
close OUT;

Answer 1

如果您sort(1)文件，则可以一次检查。不应该花费超过几秒钟（包括排序）。

另一种方法是将file1中的所有值加载到哈希中。这需要更多的内存消耗，特别是如果file1很大，但应该很快（再次，不超过几秒）。

我会选择perl而不是C来完成这样的工作，而且我在C中比在perl中更精通。这种工作在perl中编写代码要快得多，不易出错，运行速度也快。

Answer 2

此脚本运行测试用例。请注意，您的预期输出是客观错误的：在文件2第1行第20列中，值0.2存在。

#!perl

use 5.010; # just for `say`
use strict; use warnings;
use Test::More;

# define input files + expected outcome
my $file_1_contents = <<'_FILE1_';
0.1
0.11
0.12
0.13
0.14
0.15
0.16
0.17
0.18
0.19
0.2
_FILE1_

my $file_2_contents = <<'_FILE2_';
0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 0.1 0.11 0.12 0.13 0.14 0.15 0.16 0.17 0.18 0.19 0.2 0.21 0.22 0.23 0.24 0.25 0.26 0.27 0.28
1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 1.1 1.11 1.12 1.13 1.14 1.15 1.16 1.17 1.18 1.19 1.2 1.21 1.22 1.23 1.24 1.25 1.26 1.27 1.28
_FILE2_

my $expected_output = <<'_OUTPUT_';
1 2 0 0 0 0 0 0 0 10 11 12 13 14 15 16 17 18 19 20 0 0 0 0 0 0 0 0 
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 
_OUTPUT_

# open the filehandles
open my $file1, "<", \$file_1_contents or die "$!";
open my $file2, "<", \$file_2_contents or die "$!";
open my $expected, "<", \$expected_output or die "$!";

my %file1 = map { chomp; 0+$_ => undef } <$file1>;

while (<$file2>) {
    chomp;
    my @vals = split;
    # If value exists in file1, print the col number.
    my $line = join " " => map { exists $file1{0+$vals[$_]} ? $_+1 : 0 } 0 .. $#vals;
    chomp(my $expected_line = <$expected>);
    is $line, $expected_line;
}
done_testing;

要将完全相同的输出打印到28个文件，您将删除测试代码，而不是

say {$_} $line for @filehandles;

代替。

旧答案

您现有的代码效率很低且非常简单。让我来帮你解决这个问题。

首先，使用use strict; use warnings;启动所有你的Perl脚本，如果你有一个现代perl（v10或更高版本），你可以use 5.010;（或任何你的版本）是）激活其他功能。

chomp调用接受一个变量，并从字符串末尾删除$/的当前值（通常是换行符）。这很重要，因为readline运算符不会为我们这样做。声明一个常量变量并不好。相反，做

my $file   = "File1.txt"; 
my $hspfle = "File2.txt";

use strict强制您正确声明变量，您可以使用my执行此操作。

要打开文件，您应该使用以下习语：

open my $fh, "<", $filename or die "Can't open $filename: $!";

而不是or die ...，您可以在脚本顶部use autodie。

如果您无法打开文件，告诉您原因（$!）并指定显式打开模式（此处：< = read），这将中止脚本。这可以避免文件名中包含特殊字符的错误。

词法文件句柄（在my变量中，与裸字文件句柄相比）具有适当的范围，并自行关闭。您应该使用它们还有其他各种原因。

split函数采用正则表达式，而不是字符串作为第一个参数。如果你仔细检查你的程序，你会看到split中@hsporg的每个元素@k_org 28次，而if ($h_org[0] eq $orginfo[0]) { ...; } elsif ($h_org[0] ne $orginfo[0]) { ...; }中的每个元素都是28次@hsporg次。这是非常缓慢的，而且是不必要的，因为我们事先可以做到这一点。

如果条件为假，则无需再次在

中明确地测试错误

$a ne $b

因为not $a eq $b完全等同于goto。

在Perl中使用# random example LOOP: for my $i (1 .. 10) { for my $j (1 .. 5) { next if $i == $j; # start next iteration of current loop next LOOP if 2 * $i == $j; # start next iteration of labeled loop last LOOP if $i + $j == 13;# like `break` in C }是非常不同寻常的，并且跳转到某处的标签也不是特别快。标签主要用于循环控制：

redo

next循环控制动词类似于#!/usr/bin/perl use strict; use warnings; use autodie; # automatic error messages my ($file, $hspfile) = ("File1.txt", "file2.txt"); open my $fh1, "<", $file; open my $fh2, "<", $hspfile; my @k_org = <$fh1>; my @hsporg = <$fh2>; # Presplit the contents of the arrays: for my $arr (\@k_org, \@hsporg) { for (@$arr) { chomp; $_ = [ split /\t/ ]; # put an *anonymous arrayref* into each slot } } my $output_files = 28; for my $z (1 .. $output_files) { open my $out, ">", "$z.txt"; H_ORG: for my $h_org (@hsporg) { my $i = 0; ORGINFO: for my $orginfo (@k_org) { # elements in array references are accessed like $arrayref->[$i] if($h_org->[0] eq $orginfo->[0]) { print $out "$z\n"; $i = 1; last ORGINFO; # break out of this loop } elsif($h_org->[0] =~ /(\w+\s\w+)\s/ and $orginfo->[0] eq $1) { print $out "0"; $i = 1; last ORGINFO; } } print $out "0" if not $i; } } # filehandles are closed automatically.，但如果有循环条件，则不会重新检查循环条件。

由于这些循环控制功能，以及打破任何封闭循环的能力，维护标志或精心设计的结果通常是非常必要的。

这是一个清理过的脚本版本，没有修复太多的实际算法：

...;
  for (@$arr) {
    chomp;
    $_ = (split /\t/, $_, 2)[0]; # save just the first element
  }
...;
    ORGINFO:
    for my $orginfo (@k_org) {
      # elements in array references are accessed like $arrayref->[$i]
      if($h_org eq $orginfo) {
        ...;
      } elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
        ...;
      }
    }

现在我们可以进一步优化：在每一行中，您只使用第一个元素。这意味着我们不必存储其余部分：

split

此外，访问标量比访问数组元素要快一些。

last的第三个arg限制了结果片段的数量。因为我们只对第一个领域感兴趣，所以我们也不必分开其余的领域。

接下来，我们ORGINFO循环H_ORG，然后检查一个标志。这是不必要的：我们可以直接跳转到ORGINFO循环的下一次迭代，而不是设置标志。如果我们自然退出print循环，则保证不会设置该标志，因此我们可以执行H_ORG: for my $h_org (@hsporg) { for my $orginfo (@k_org) { if($h_org eq $orginfo) { print $out "$z\n"; next H_ORG; } elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) { print $out "0"; next H_ORG; } } print $out "0"; }：

print_index

然后，将相同的数据进行28次比较，将其打印到不同的文件中。更好：定义两个子print_zero和# make this initialization *before* you use the subs! my @filehandles = map {open my $fh, ">", "$_.txt"; $fh} 1 .. $output_files; ...; # the H_ORG loop sub print_index { for my $i (0 .. $#filehandles) { print {$filehandles[$i]} $i+1, "\n"; } } sub print_zero { print {$_} 0 for @filehandles; }。在这些内容中，您循环输出文件句柄：

  # no enclosing $z loop!
  H_ORG:
  for my $h_org (@hsporg) {
    for my $orginfo (@k_org) {
      if($h_org eq $orginfo) {
        print_index()
        next H_ORG;
      } elsif($h_org =~ /(\w+\s\w+)\s/ and $orginfo eq $1) {
        print_zero();
        next H_ORG;
      }
    }
    print_zero();
  }

然后：

{{1}}

这可以避免检查您已知的数据不匹配。

Answer 3

在C中你可以尝试使用“qsort”和“bsearch”函数

首先，您需要将文件加载到数组中。

比你应该执行qsort（）（除非你确定元素有一个命令）。并使用bsearch（）对数组执行二进制搜索。

http://linux.die.net/man/3/bsearch

这比逐个检查所有元素要快得多。

如果它不存在，你可以尝试在perl中实现二进制搜索，这是一个简单的算法。

比较两个大文件的所有元素

文件1：

文件2：

编辑：

预期输出：

3 个答案:

旧答案