I have a sequence big file with 'Ç' as delimiter. We need to split at every 40th 'Ç' into a new line.
We have tried using the perl/cut command,but we're getting "out of memory" error,because it's a huge file and read/write happens at one time.
So what I'd want is the following
Cut at every 40th delimter occurrence and write/flush to the file and not hold in memory and again do the same for the next 40 and so on.
Is this achievable in Bash ?
Any help would be highly appreciated.
Edit:
This is the command we used in PERL
perl -pe 's{Ç}{++$n % 40 ? $& : "\n"}ge' <file_name>
Say the data is as follows.
123ÇasfiÇsadfÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdataÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdata
I want to cut at (say 3rd delimiter to new line) and assign to a variable or something and flush it to the file so that memory is cleared.
Expected output
123ÇasfiÇsadf
test1Ç123Çasfi
sadfÇtest1Çmock
Note:It's a huge sequence file. We're able to achieve the desired output with the above command, but for a larger file it throws memory exception and hence we want to flush the chunks.
答案 0 :(得分:1)
这有点长,但告诉Perl将Ç
视为记录分隔符而不是\n
;然后你可以在阅读时加入“行”,批量处理它们,并将它们分组输出。 (我的Perl生锈了;可能有一种更简单的方法。)
perl -ne 'BEGIN {$/="Ç"; $c=0; sub d { chomp $out; print "$out\n"; $out=""; $c=0; }}
$out .= $_; $c++; &d if $c == 3;
END { &d }' tmp.txt
在脚本开头,我们将$/
从其默认值newline更改为您的分隔符;现在,“line”被定义为以Ç结尾的字符串。我们初始化一个计数器$c
以跟踪我们读取了多少行,并定义了一个子程序来输出变量$out
中累积的行,然后重置累加器和计数器。
对于每一行输入,我们首先将该行附加到累加器,递增计数器,然后在计数器的值达到目标组大小时调用输出例程。
最后,我们调用输入末尾的输出例程来清除累加器中剩余的任何行。
答案 1 :(得分:0)
如果Python是一个选项,这里是我提议的C代码的一个端口:
# -*- coding: latin1 -*-
import sys
def cvt(fdin, fdout, delim, count):
curr = count
while True:
c = fdin.read(1)
if c is None or c == '': break
if c == delim:
curr -= 1
if curr == 0:
curr = count
c = '\n'
dummy = fdout.write(c)
cvt(sys.stdin, sys.stdout, 'Ç', 3)
它按预期给出:
echo "123ÇasfiÇsadfÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdataÇtest1Ç123ÇasfiÇsadfÇtest1ÇmockÇdata" | python ess.py
123ÇasfiÇsadf
test1Ç123Çasfi
sadfÇtest1Çmock
dataÇtest1Ç123
asfiÇsadfÇtest1
mockÇdata