Question

我有一些我需要排序的巨大日志文件。所有条目都有一个32位十六进制数，这是我想要使用的排序键。一些条目是一个像

的衬里

bla bla bla  0x97860afa bla bla

其他的有点复杂，从上面的相同类型的行开始，并扩展到由大括号标记的行块，如下例所示。在这种情况下，整个块必须移动到hex nbr定义的位置。阻止示例 -

 bla bla bla  0x97860afc bla bla  
     bla bla {  
         blabla  
            bla bla {  
                bla     
            }  
        }

我可能想出来但也许有一个简单的perl或awk解决方案可以节省我1/2天。

从OP转移评论：

缩进可以是空格或制表符，我可以在任何建议的解决方案上增强它，我认为Brian总结得很好：具体来说，你想要对“项目”进行排序，这些项目被定义为以文本开头的一大块文本包含“0xNNNNNNNN”的行，包含下一行（但不包括）下一行的所有内容，其中包含“0xNNNNNNNN”（当然N的更改）。没有穿插线条。

Answer 1

这样的事情可能有效（未经测试）：

my $line;
my $lastkey;
my %data;
while($line = <>) {
    chomp $line;
    if ($line =~ /\b(0x\p{AHex}{8})\b/) {
        # Begin a new entry
        my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
        $data{$1} = $line;
        $lastkey = $unique_key;
    } else {
        # Continue an old entry
        $data{$lastkey} .= $line;
    }
}

print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);

问题在于您说“巨大”的日志文件，因此将文件存储在内存中可能效率低下。但是，如果你想对它进行排序，我怀疑你需要这样做。

如果无法存储在内存中，您可以随时将数据打印到文件中，其格式允许您通过其他方式对其进行排序。

Answer 2

对于大量数据文件，我建议使用Sort::External。
如果缩进完成工作，看起来您不需要解析括号。然后你必须在“休息”上做，或者当缩进级别为0，然后你处理收集的最后一条记录，所以你总是向前看一行。

所以：

sub to_sort_form {
    my $buffer = $_[0];
    my ( $id ) = $buffer =~ m/(0x\p{AHex}{8})/; # grab the first candidate
    return "$id-:-$buffer";
    $_[0] = '';
}

sub to_source { 
    my $val = shift;
    my ( $record ) = $val =~ m/-:-(.*)/;
    $record =~ s/\$--\^/\n/g;
    return $record;
}

my $sortex = Sort::External->new(
      mem_threshold   => 1024**2 * 16     # default: 1024**2 * 8 (8 MiB)
    , cache_size      => 100_000          # default: undef (disabled) 
    , sortsub         => sub { $Sort::External::a cmp $Sort::External::b }
    , working_dir     => $temp_directory  # default: see below
);

my $id;
my $buffer = <>;
chomp $buffer;
while ( <> ) { 
    my ( $indent ) = m/^(\s*)\S/;
    unless ( length $indent ) {
        $sortex->feed( to_sort_form( $buffer ));
    }
    chomp;
    $buffer .= $_ . '$--^';
}
$sortex->feed( to_sort_form( $buffer ));
$sortex->finish;

while ( defined( $_ = $sortex->fetch ) ) {
    print to_source( $_ );
}

<强>假设：

字符串'$--^'不会单独出现在数据中。
您不会对一条记录中的两个8位十六进制数字字符串感到惊慌。

Answer 3

如果文件对于内存来说不是太大，我会选择TLP的解决方案。如果是，您可以稍微修改它并按照他的建议打印到文件。在while之前添加这个（所有未经测试的，ymmv，警告程序员等）：

my $currentInFile        = "";
my $currentOutFileHandle = "";

将while的正文从当前if-else更改为

if ($currentInFile ne $ARG) {
    if (fileno($currentOutFileHandle)) {
        if (!close($currentOutFileHandle)) {
            # whatever you want to do if you can't close the previous output file
        }
    }
    my $newOutFile = $ARG . ".tagged";
    if (!open($currentOutFileHandle, ">", $newOutFile)) {
        # whatever you want to do if you can't open a new output file for writing
    }
}

if (...conditional from TLP...) {
    # add more zeroes if the files really are that large :)
    $lastkey = $1 . " " . sprintf("%0.10d", $.);
}

if (fileno($currentOutFileHandle)) {
    print $currentOutFileHandle $lastkey . "\t" . $line;
}
else {
    # whatever you want to do if $currentOutFileHandle's gone screwy
}

现在你为它喂的每个foo.log都有一个foo.log.tagged; .tagged文件包含原始内容，但每行前面加上“0xNNNNNNNN LLLLLLLLLL \ t”（LLLLLLLLLL - ＆gt;零填充行号）。 sort(1)实际上在处理大数据方面做得非常好，但是如果你认为它会溢出/ tmp并使用其临时文件同时咀嚼你提供的东西，你会想看一下--temporary-directory参数它。这样的事情应该让你开始：

sort --output=/my/new/really.big.file --temporary-directory=/scratch/dir/on/roomy/partition *.tagged

然后根据需要修剪标签：

perl -pi -e 's/^[^\t]+\t//' /my/new/really.big.file

FWIW，我填写了行号，以避免担心第2行之前的第10行排序，如果它们的十六进制密钥相同 - 因为十六进制数是主要的排序标准，我们不能只是按数字排序

Answer 4

单向（未经测试）

perl -wne'BEGIN{ $key = " " x 10 }' \
    -e '$key = $1 if /(0x[0-9a-f]{8})/;' \
    -e 'printf "%s%.10d%s", $key, $., $_' \
    inputfile \
    | LC_ALL=C sort \
    | perl -wpe'substr($_,0,20,"")'

Answer 5

来自TLP的解决方案很好地进行了一些小调整。在排序之前添加所有内容是一个好主意，接下来我必须添加一个pos解析来恢复已折叠但很容易的代码块。以下是最终测试版本。谢谢大家，stackoverflow很棒。

#!/usr/bin/perl -w
my $line;
my $lastkey;
my %data;
while($line = <>) {
  chomp $line;
  if ($line =~ /\b(0x\p{AHex}{8})\b/) {
    # Begin a new entry
    #my $unique_key = $1 . $.; # cred to [Brian Gerard][1] for uniqueness
    my $unique_key = hex($1);
    $data{$unique_key} = $line;
    $lastkey = $unique_key;
  } else {
    # Continue an old entry
    $data{$lastkey} .= $line;
  }
}
print $data{$_}, "\n" for (sort { $a <=> $b } keys %data);

perl排序问题

5 个答案: