将较大的文件拆分为较小的文件:有关'拆分'的帮助

时间:2012-06-01 13:41:14

标签: shell awk split gawk

我有一个大文件(2GB),看起来像这样:

  >10GS_A
  YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
  LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
  DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
  LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ 
  >11BA_A
  KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
  CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
  SVPVHFDASV
  >11BG_A
  KESAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAVCSQKKVT
  CKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKPSVPVHFDASV
  >121P_A
  MTEYKLVVVGAGGVGKSALTIQLIQNHFVDEYDPTIEDSYRKQVVIDGETCLLDILDTAGQEEYSAMRD 
  QYMRTGEGFLCVFAINNTKSFEDIHQYREQIKRVKDSDDVPMVLVGNKCDLAARTVESRQAQDLARSYG
  IPYIETSAKTRQGVEDAFYTLVREIRQH

我想将此文件拆分为基于分隔符“>”的较小文件在这种情况下,生成4个文件,其中包含以下文本,并按以下方式命名:

10gs_A.txt
11ba_A.txt
11bg_A.txt
121p_A.txt

和他们包含以下内容: 10gs_A.txt

>10GS_A
YTVVYFPVRGRCAALRMLLADQGQSWKEEVVTVETWQEGSLKASCLYGQLPKFQDGD
LTLYQSNTILRHLGRTLGLYGKDQQEAALVDMVNDGVEDLRCKYISLIYTNYEAGKD
DYVKALPGQLKPFETLLSQNQGGKTFIVGDQISFADYNLLDLLLIHEVLAPGCLDAFP
LLSAYVGRLSARPKLKAFLASPEYVNLPINGNGKQ

11ba_A.txt

>11BA_A
KESAAAKFERQHMDSGNSPSSSSNYCNLMMCCRKMTQGKCKPVNTFVHESLADVKAV
CSQKKVTCKNGQTNCYQSKSTMRITDCRETGSSKYPNCAYKTTQVEKHIIVACGGKP
SVPVHFDASV

......等等。 我知道在linux中使用split命令分离更大的文本文件,但是它将创建的文件命名为temp00,temp01,temp03。 有没有办法拆分这个更大的文件,并将文件命名为我想要的? 实现这个的分裂功能是什么?

2 个答案:

答案 0 :(得分:1)

如何使用awk脚本拆分mybigfile

splitter.awk

BEGIN {outname = "noname.txt"}

/^>/  { outname = substr($0,2,40) ".txt"
        next }

      { print > outname }

如果要在输出中使用分隔符行,请使用以下命令:

splitter.awk

BEGIN {outname = "noname.txt"}

/^>/  { outname = substr($0,2,40) ".txt"}

      { print > outname }

然后运行此文件

awk -f splitter.awk mybigfile

答案 1 :(得分:1)

使用gawk即可 -

gawk -v RS='>' 'NF{ print RS$0 > $1".txt" }' InputFile