How to insert \n after every nth delimiter without "memory issue" in Bash

时间:2017-07-12 08:07:53

标签: bash shell out-of-memory

I have a sequence big file with 'Ç' as delimiter. We need to split at every 40th 'Ç' into a new line.

We have tried using the perl/cut command,but we're getting "out of memory" error,because it's a huge file and read/write happens at one time.

So what I'd want is the following

Cut at every 40th delimter occurrence and write/flush to the file and not hold in memory and again do the same for the next 40 and so on.

Is this achievable in Bash ?

Any help would be highly appreciated.

Edit:

This is the command we used in PERL

perl -pe 's{Ç}{++$n % 40 ? $& : "\n"}ge' <file_name>

Say the data is as follows.

123ÇasfiÇsadfÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdataÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdata

I want to cut at (say 3rd delimiter to new line) and assign to a variable or something and flush it to the file so that memory is cleared.

Expected output

123ÇasfiÇsadf
test1Ç123Çasfi
sadfÇtest1Çmock

Note:It's a huge sequence file. We're able to achieve the desired output with the above command, but for a larger file it throws memory exception and hence we want to flush the chunks.

2 个答案:

答案 0 :(得分:1)

这有点长,但告诉Perl将Ç视为记录分隔符而不是\n;然后你可以在阅读时加入“行”,批量处理它们,并将它们分组输出。 (我的Perl生锈了;可能有一种更简单的方法。)

 perl -ne 'BEGIN {$/="Ç"; $c=0; sub d { chomp $out; print "$out\n"; $out=""; $c=0; }}
           $out .= $_; $c++; &d if $c == 3;
           END { &d }' tmp.txt

在脚本开头,我们将$/从其默认值newline更改为您的分隔符;现在,“line”被定义为以Ç结尾的字符串。我们初始化一个计数器$c以跟踪我们读取了多少行,并定义了一个子程序来输出变量$out中累积的行,然后重置累加器和计数器。

对于每一行输入,我们首先将该行附加到累加器,递增计数器,然后在计数器的值达到目标组大小时调用输出例程。

最后,我们调用输入末尾的输出例程来清除累加器中剩余的任何行。

答案 1 :(得分:0)

如果Python是一个选项,这里是我提议的C代码的一个端口:

# -*- coding: latin1 -*-
import sys

def cvt(fdin, fdout, delim, count):
    curr = count
    while True:
        c = fdin.read(1)
        if c is None or c == '': break
        if c == delim:
            curr -= 1
            if curr == 0:
                curr = count
                c = '\n'
        dummy = fdout.write(c)

cvt(sys.stdin, sys.stdout, 'Ç', 3)

它按预期给出:

echo "123ÇasfiÇsadfÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdataÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdata" | python ess.py
123ÇasfiÇsadf
test1Ç123Çasfi
sadfÇtest1Çmock
dataÇtest1Ç123
asfiÇsadfÇtest1
mockÇdata