在第N次出现分隔符时拆分文件

时间:2013-03-21 23:19:45

标签: file unix split chunking

在分隔符的每次 第N次出现 后,是否有单行将文本文件拆分为多个/块?

示例:下面的分隔符是“+”

entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
...

有几百万个条目,因此每次出现分隔符“+”时都会分开是一个坏主意。我想分开,例如,每隔50,000个分隔符“+”实例。

Unix命令“split”和“csplit”似乎没有这样做......

3 个答案:

答案 0 :(得分:13)

使用awk你可以:

awk '/^\+$/ { delim++ } { file = sprintf("chunk%s.txt", int(delim / 50000)); print >> file; }' < input.txt 

更新

要不包含分隔符,请尝试以下方法:

awk '/^\+$/ { if(++delim % 50000 == 0) { next } } { file = sprintf("chunk%s.txt", int(delim / 50000)); print > file; }' < input.txt 

next关键字导致awk暂停此记录的处理规则并前进到下一个(行)。我还将>>更改为>,因为如果您多次运行它,您可能不想附加旧的块文件。

答案 1 :(得分:1)

如果你找不到合适的替代方案(它会表现得很好),在Perl中做起来并不是很难:

#!/usr/bin/env perl
use strict;
use warnings;

# Configuration items - could be set by argument handling
my $prefix = "rs.";     # File prefix
my $number = 1;         # First file number
my $width  = 4;         # Number of digits to use in file name
my $rx     = qr/^\+$/;  # Match regex
my $limit  = 3;         # 50,000 in real case
my $quiet  = 0;         # Set to 1 to suppress file names

sub next_file
{
    my $name = sprintf("%s%.*d", $prefix, $width, $number++);
    open my $fh, '>', $name or die "Failed to open $name for writing";
    print "$name\n" unless $quiet;
    return $fh;
}

my $fh = next_file;  # Output file handle
my $counter = 0;     # Match counter
while (<>)
{
    print $fh $_;
    $counter++ if (m/$rx/);
    if ($counter >= $limit)
    {
        close $fh;
        $fh = next_file;
        $counter = 0;
    }
}
close $fh;

这远不是一个单行;我不确定这是否属实。应配置的项目组合在一起,例如,可以通过命令行选项进行设置。 你最终可能会得到一个空文件;您可以发现并在必要时将其删除。你需要第二个柜台;现有的是一个'匹配计数器',但你还需要一个行计数器,如果行计数器为零,你将删除最后一个文件。你还需要这个名字才能将它删除......时髦,但并不困难。

给出输入(基本上是样本数据的两个副本),repsplit.pl的输出(重复分割)如下所示:

$ perl repsplit.pl data
rs.0001
rs.0002
rs.0003
$ cat data
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
entry 4
some more
+
$ cat rs.0001
entry 1
some more
+
entry 2
some more
even more
+
entry 3
some more
+
$ cat rs.0002
entry 4
some more
+
entry 1
some more
+
entry 2
some more
even more
+
$ cat rs.0003
entry 3
some more
+
entry 4
some more
+
$

答案 2 :(得分:0)

在简洁的“单行”中使用+作为输入分隔符

如果你想像评论中所说的那样$_ > newprefix.part.$c

$ limit=50000 perl -053 -Mautodie -lne '
    BEGIN{$\=""}
    $count++;
    if ($count >= $ENV{limit}) {
        open my $fh, ">", "newprefix.part.$c";
        print $fh $_;
        close $fh;
    }
' file.txt

$ ls -l newprefix.part.*

文档