Question

大家好，所以我必须排序一个大约10k行的文件，我写了这段代码，但是我花了很多时间才完成，我问了一个人，他告诉使用参考它不会花费那么多时间，但我不知道在哪里使用它们这就是我在perl中所做的：

use strict;
use warnings;

open( IN, "dico_corpus.dic" ) or die "$!";
my @tab;
my $i;
my @tabs;
my $c;
my @tabs2;
$i   = 0;
$c   = 0;
@tab = <IN>;
#here i will read line buy line and put the 3rd colmun(which i want to sort in tabs2)
for ( $i = 0; $i < $#tab; $i++ ) {
    @tabs = split( /\s+/, $tab[$i] );

    $tabs2[$c] = $tabs[2];

    $c++;

}// here tabs2 contain the 3rd colmun to sort


@tabs2 = sort(@tabs2);

open( OUT, ">>resultat.txt" );# to print result by adding line by line to resultat.txt

foreach my $word (@tabs2) {# here i will take the first value in tabs2
                           # and get the first line from the original file
                           # and test the 3rd colmun if its the same so i 
                           # print the whole line if its not so to the next  
                           #line 

    foreach my $var (@tab) {
        @tabs = split( /\s+/, $var );

        if ( $word eq $tabs[2] ) {
            my $ligne = join( "\t", $tabs[1], $tabs[0], $tabs[2] );
            print OUT $ligne, "\n";
        }
    }
}

close(IN);
close(OUT);

原始文件中的一些行

3851 4178 de

1972 6643 la

13912246à

1098 5163 et

656 8429 que

Answer 1

您可以使用Schartzian Transform：

Class<E>

<强>输出：

#!/usr/bin/perl
use Data::Dumper;
use Modern::Perl;


chomp (my @lines = <DATA>);

my @sorted = 
    map {$_->[0]}
    sort {$a->[1] cmp $b->[1]}
    map { my $third = (split/\s+/,$_)[2]; [$_, $third] }
        @lines;

say Dumper\@sorted;


__DATA__
3851 4178 de
1972 6643 la
1391 2246 à
1098 5163 et
656 8429 que

Answer 2

它很慢的原因是嵌套的foreach循环给你一个10K x 10K的内循环。您的朋友告诉您使用散列tabs2作为键，记录作为值（$myhash{$tabs[2]}=$tab[$i]）。然后，您在sort keys %myhash上执行一次循环并打印$myhash{$thekey}。

Answer 3

实际上，这里可以使用Schwartzian变换（ST）@toto引用。但我认为这对你来说似乎有点模糊，我想展示一个更明确的解决方案。这将比ST慢，但对初学者来说可能更容易阅读。

第一个块只是将完整的输入文件读入数组@lines。我使用了推荐的3参数打开。有关详细信息，请参阅Perl的tutorial on open。

Perl有一个内置的sort function，它按字典顺序对列表（或数组）进行排序（即('c', 'a', 'b') → ('a', 'b', 'c')）。如果这不符合您的需求，您还可以提供自定义比较功能，就像我在这里使用by_third_column一样。使用魔术参数$a和$b调用此函数。这些是将要比较的项目。在您的情况下，$a和$b是输入的一些（任意）完整行，函数必须决定哪一行是“更大”。

因此函数by_third_column将两行给定的行分开，并选取这些行的第三项（“字段”）。这是my $a3 = …和my $b3 = …部分。然后将这些第三个字段按字典比较（$a3 cmp $b3）。

最后，我们在sort数组上调用@lines，但提供自定义比较功能。最后一个块只是将已排序的输出输出（附加）到文件'resultat.txt'。

#!/usr/bin/env perl

use strict;
use warnings;

open( my $in, '<', 'dico_corpus.dic' ) or die "$!";
my @lines = <$in>;
close($in);

sub by_third_column
{
    my $a3 = ( split /\s+/, $a )[2];
    my $b3 = ( split /\s+/, $b )[2];
    return $a3 cmp $b3;
}

my @sorted = sort by_third_column @lines;

open( my $out, '>>', 'resultat.txt' ) or die "$!";
print $out @sorted;
close($out);

更新对@ toto和我的回答的评论让我好奇，所以我做了Benchmark。我在三个子程序中封装了原始代码，toto的Schwartzian变换，以及我对自定义比较函数的建议。我设置了一个输入数组，其中包含10_000行，每行包含三个10个字母的随机单词：

vfkyscicki nqqnfpjylf kevurxexov
bqordmljgh nrypcmvids tvsxsqhizl
uequmgbhbg bnfdyxgcpo krwnjfuhpe
...

基准测量用

衡量

my $speed = Benchmark::timethese(
    -250,
    {
        Custom   => \&custom,
        ST       => \&ST,
        Original => \&original,
    }
);
Benchmark::cmpthese($speed);

我不得不使用那么多CPU秒，因为ST / Custom和Original之间的性能差异非常大，我总是得到“警告：可靠计数的迭代次数太少”。结果是

Benchmark: running Custom, Original, ST for at least 250 CPU seconds...
    Custom: 269 wallclock secs (268.76 usr +  0.00 sys = 268.76 CPU) @  9.86/s (n=2651)
  Original: 253 wallclock secs (252.18 usr +  0.00 sys = 252.18 CPU) @  0.02/s (n=4)
        ST: 272 wallclock secs (271.72 usr +  0.00 sys = 271.72 CPU) @ 32.82/s (n=8918)

               Rate Original   Custom       ST
Original 1.59e-02/s       --    -100%    -100%
Custom       9.86/s   62086%       --     -70%
ST           32.8/s  206817%     233%       --

正如你所看到的，Schwartzian变换：“douze points”。它比自定义比较功能快3-4倍，而后者比10_000输入线的原始方法快约600倍。

所以ST 比自定义比较函数更快（我没有怀疑），但真正的改进不是迭代n²超过输入的时间。

如何使用PERL中的refernce按第3列对文件进行排序

3 个答案: