我有一个大文件(2GB),看起来像这样:
>10GS_A
YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
>11BA_A
KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
SVPVHFDASV
>11BG_A
KESAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAVCSQKKVT
CKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKPSVPVHFDASV
>121P_A
MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRD
QYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYG
IPYIETSAKTRQGVEDAFYTLVREIRQH
我想将此文件拆分为基于分隔符“>”的较小文件在这种情况下,生成4个文件,其中包含以下文本,并按以下方式命名:
10gs_A.txt
11ba_A.txt
11bg_A.txt
121p_A.txt
和他们包含以下内容: 10gs_A.txt
>10GS_A
YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ
11ba_A.txt
>11BA_A
KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
SVPVHFDASV
......等等。 我知道在linux中使用split命令分离更大的文本文件,但是它将创建的文件命名为temp00,temp01,temp03。 有没有办法拆分这个更大的文件,并将文件命名为我想要的? 实现这个的分裂功能是什么?
答案 0 :(得分:1)
如何使用awk脚本拆分mybigfile
splitter.awk
BEGIN {outname = "noname.txt"}
/^>/ { outname = substr($0,2,40) ".txt"
next }
{ print > outname }
如果要在输出中使用分隔符行,请使用以下命令:
splitter.awk
BEGIN {outname = "noname.txt"}
/^>/ { outname = substr($0,2,40) ".txt"}
{ print > outname }
然后运行此文件
awk -f splitter.awk mybigfile
答案 1 :(得分:1)
使用gawk
即可 -
gawk -v RS='>' 'NF{ print RS$0 > $1".txt" }' InputFile