Question

我必须将一个非常大的文件拆分为N个较小的文件，并具有以下约束：

我必须拆分记录边框
记录分隔符可以是任何字符
生成的N个文件中的记录数应该相同（+/- 1条记录）
我可以使用bash和标准coreutils（我在Perl中有一个可行的解决方案但是我们不允许安装Perl / Python /等）
这不是真正的约束，但是 - 如果可能的话 - 我只想扫描原始（大）文件一次。

生成的文件的排序顺序并不重要。

我在Perl中的工作解决方案读取原始文件并写入...

- the 1st record to the first file
- ...
- the Nth record to the Nth file
- the N+1 record back to the first file
- etc

所以 - 最后 - 通过对初始文件的单次扫描，我会得到几个具有相同记录数（+/- 1）的较小文件。

例如，假设这是输入文件：

1,1,1,1A2,2,2,2A3,
3,3,3A4,4,4,4A5,5,
5,5A6,6,6,6A7,7,7,
7,A8,8,8,8A9,9,9,9
A0,0,0,0

使用记录分隔符=＆＃39; A＆＃39;和N = 3我应该得到三个文件：

# First file:
1,1,1,1A2,2,2,2A3,
3,3,3

# Second file
4,4,4,4A5,5,
5,5A6,6,6,6

# Third file:
7,7,7,
7,A8,8,8,8A9,9,9,9
A0,0,0,0

更新

这里有perl代码。我试着让它尽可能简单易读：

#!/usr/bin/perl

use warnings;
use strict;
use locale;
use Getopt::Std;

#-----------------------------------------------------------------------------
# Declaring variables
#-----------------------------------------------------------------------------
my %op = ();        # Command line parameters hash
my $line = 0;       # Output file line number
my $fnum = 0;       # Output file number
my @fout = ();      # Output file names array
my @fhnd = ();      # Output file handles array
my @ifiles = ();    # Input file names
my $i = 0;          # Loop variable

#-----------------------------------------------------------------------------
# Handling command line arguments
#-----------------------------------------------------------------------------
getopts("o:n:hvr:", \%op);
die "Usage: lfsplit [-h] -n number_of_files",
    " [-o outfile_prefix] [-r rec_sep_decimal] [-v] input_file(s)\n"
    if $op{h} ;
if ( @ARGV ) {
    @ifiles = @ARGV ;
} else {
    die "No input files...\n" ;
}
$/ = chr($op{r}) if $op{r} ;

#-----------------------------------------------------------------------------
# Setting Default values
#-----------------------------------------------------------------------------
$op{o} |= 'out_' ;

#-----------------------------------------------------------------------------
# Body - split in round-robin to $op{n} files
#-----------------------------------------------------------------------------
for ( $i = 0 ; $i < $op{n} ; $i++ ) {
    local *OUT ;                # Localize file glob
    $fout[$i] = sprintf "%s_%04d.out", $op{o}, $i ;
    open ( OUT, "> $fout[$i]" ) or
        die "[lfsplit] Error writing to $fout[$i]: $!\n";
    push ( @fhnd , *OUT ) ;
}
$i = 0 ;
foreach ( @ifiles ) {
    print "Now reading $_ ..." if $op{v} ;
    open ( IN, "< $_" ) or
        die "[lfsplit] Error reading $op{i}: $!\n" ;
    while ( <IN> ) {
        print { $fhnd[$i] } $_ ;
        $i = 0 if ++$i >= $op{n} ;
    }
    close IN ;
}
for ( $i = 0 ; $i < $op{n} ; $i++ ) {
    close $fhnd[$i] ;
}

#-----------------------------------------------------------------------------
# Exit
#-----------------------------------------------------------------------------
exit 0 ;

Answer 1

只是为了踢，纯粹的bash解决方案，没有外部程序而且没有分叉（我认为）：

#!/bin/bash

input=$1
separator=$2
outputs=$3

i=0
while read -r -d"$separator" record; do
  out=$((i % outputs)).txt
  if ((i < outputs)); then
    : > $out
  else
    echo -n "$separator" >> $out
  fi
  echo -n "$record" >> $out
  ((i++))
done < $input

遗憾的是，这将为每个输出操作重新打开每个文件。我确信可以解决这个问题，使用<>打开文件描述符并保持打开状态，但使用非文字文件描述符会有点痛苦。

在记录边框

1 个答案: