Question

这篇文章中的私人信息。除去。

Answer 1

bash脚本的问题在于，虽然它非常灵活且功能强大，但它几乎可以为任何东西创建新的流程，而分叉也很昂贵。在循环的每次迭代中，您产生3×echo，2×awk，1×sed和1×perl。将自己限制在一个进程（因此，一种编程语言）将提高性能。

然后，您每次都会在output.txt的来电中重新阅读perl。 IO总是很慢，所以如果你有内存，缓冲文件会更有效率。

如果没有哈希冲突，多线程可以工作，但很难编程。简单地转换为Perl会比将Perl转换为多线程Perl提高性能。^{[citation needed]}

您可能会写一些类似

的内容

#!/usr/bin/perl
use strict; use warnings;
open my $cracked, "<", "cracked.txt" or die "Can't open cracked";
my @data = do {
  open my $output, "<", "output.txt" or die "Can't open output";
  <$output>;
};

while(<$cracked>) {
  my ($hash, $seed, $pwd) = split /:/, $_, 3;
  # transform $hash here like "$hash =~ s/foo/bar/g" if really neccessary

  # say which line we are at
  print "at line $. with pwd=$pwd\n";

  # do substitutions in @data
  s/\Q$hash\E/$hash ( $pwd )/ for @data;
  # the \Q...\E makes any characters in between non-special,
  # so they are matched literally.
  # (`C++` would match many `C`s, but `\QC++\E` matches the character sequence)
}

# write @data to the output file

（未经测试或任何事情，无保证）

虽然这仍然是 O（n²）解决方案，但它的性能优于bash脚本。请注意，将@data组织到散列树中时，可以将其缩减为 O（n），并使用哈希码编制索引：

my %data = map {do magic here to parse the lines, and return a key-value pair} @data;
...;
$data{$hash} =~ s/\Q$hash\E/$hash ( $pwd )/; # instead of evil for-loop

实际上，您将存储对包含哈希树中包含哈希代码的所有行的数组的引用，因此前面的行宁愿是

my %data;
for my $line (@data) {
   my $key = parse_line($line);
   push @$data{$key}, $line;
}
...; 
s/\Q$hash\E/$hash ( $pwd )/ for @{$data{$hash}}; # is still faster!

另一方面，8E7元素的哈希值可能不会很好。答案在于基准测试。

Answer 2

如果你想改善 perl RE ，你必须非常精确地找到你的字符串：

尽可能选择/^.\{12\}GnaGna/，而不仅仅是/GnaGna/

您可以在CPAN尝试模块Regexp::Debugger。

Answer 3

在解析我的工作日志时，我做了这件事：分割N个部分的文件（N = num_processors）;将分割点对齐\ n。启动N个线程来处理每个部分。工作非常快，但硬盘是瓶颈。

sed / perl正则表达式非常慢

3 个答案: