Question

我有非常大的二进制文件，没有行，没有字段分隔符。目标是有效地将这些文件处理到制表符分隔文件中。

文件结构如下：

每条记录都是固定长度，20个字节。每个字段是不同的长度，三个字段的长度分别为3,7和10个字节。每个字段也表示不同的数据类型，字段1和2是int，3是char。

处理这些文件的最有效方法是什么？我想尽可能简单，使用Bash工具，dd / od sed / awk，尽可能避免使用perl / python，除非性能差异极端。

下面是一个有效的尝试，它很慢。我是上述工具的新手，非常感谢您详细解释。

binfile="binfile.BIN"

for (( i = 0 ; i <= 20000000 ; i += 20 ))
do
    field1=$( od "${binfile}" -An --skip-bytes"$((${i}))" --read-bytes=3 --format=dI )
    field2=$( od "${binfile}" -An --skip-bytes"$((${i}+3))" --read-bytes=7 --format=dI )
    field3=$( od "${binfile}" -An --skip-bytes"$((${i}+10))" --read-bytes=10 --format=c )

    echo - ${field1}'\t'${field2}'\t'${field3} >> output.tab
done

Answer 1

fold -b -w 20 | cut --output-delimiter $'\t' -b 1-3,4-10,11-20

如果你的“cut”不支持--output-delimiter，请尝试“gcut”（GNU cut）或考虑安装GNU coreutils。

（请告诉我们您尝试的不同解决方案的速度有多快： - ）

Answer 2

open my $fh, '<:raw', shift;

local $" = "\t";

while ( read $fh, my $rec, 20 ) {
    my @f = unpack 'a3 a7 a10', $rec;
    print "@f\n";
}

Answer 3

从STDIN读取，输出到STDOUT，并执行错误检查：

#!/usr/bin/perl

use strict;
use warnings;

use constant BLOCK_SIZE => 20;

binmode STDIN;    

while (1) {
    my $rv = read(STDIN, my $buf, BLOCK_SIZE);
    die("Error: $!\n") if !defined($rv);
    last if !$rv;
    die("Error: Insufficient data\n") if $rv != BLOCK_SIZE;
    print(join("\t", unpack('a3 a7 a10', $buf)), "\n");
}

但是我很确定你会发现它比一次读取的速度慢，所以我会使用以下内容：

#!/usr/bin/perl

use strict;
use warnings;

use constant BLOCK_SIZE => 20;

binmode STDIN;    

my $buf;    
while (1) {
    my $rv = sysread(STDIN, $buf, BLOCK_SIZE*64*1024, length($buf));
    die("Error: $!\n") if !defined($rv);
    last if !$rv;

    while (length($buf) >= BLOCK_SIZE) {
       print(join("\t", unpack('a3 a7 a10', substr($buf, 0, BLOCK_SIZE, '')), "\n");
    }
}

die("Error: Insufficient data\n") if length($buf);

读取，处理连续二进制文件 - 高效

3 个答案: