在linux中替换序列文件fasta中的空格

时间:2014-02-24 08:58:43

标签: unix grep fasta

我有一个输出文件,我想将其转换为I文件,我可以提取整个序列。

输出是这样的:(但是然后没有“在>前面,并且在每行之后有一个输入

">comp2_c0_seq1 len=265 path=[1:0-264]
GTTTGAATGGTTTGTGGTTCTGCCTTTGACAAACTGATCATAGTGGAATAATAAGGGAAC
ATGAAGAAATTCCAAGCCCATTGATTTTCTCTTGAGACCAATTAGGTAAAGTCACTCAAA
ATTTTTGAGAGTGGATGCTCAGAGGTAACACTTTGGCATAGAATTGTTAAATAGCATGCA
CTTTAATGGAAGAATAGAATCATTAAAGATTGGTTGATAACAAGTCACAGTGTATTTAAC
CATCATCACAGCAGATGTAGACAGA

">comp2_c1_seq1 len=203 path=[2794:0-202]
CAGCAGATGTAGACAGAAATGGCACCACTGCTTATGAAGGAAACTGGAACCCAGAAGCAC
ACCCTCCCCTCATTCACCATGAGCATCATGAGGAAGAAGAGACCCCACATTCTACAAGCA
CAAGTAAGCAAGATGGCGGTCGGCAGTTCTGGGTTAGATGAATTAGTAAAGACATTCCAG
CAATAGGGAAGATTTTGTTTAGA

">comp6_c0_seq1 len=424 path=[1744:0-423]
CCAGCTCCTACTCACCAGTCTCTCCGCATGGAGAAGTGGCCGTCATGGTCGACCTGTTCC
CAAGGGTGGCCTTGTGAGTGCAGGCTCTCCTCACCAGAGCTGAGGGCTTTGTGAACCTCT
GATGTCAATAGATGCCCCTCATCTTCCAGGAGGACAAAACAGGGCAAAGCAAGACATGGG
GTGAGAACAGGAGTGCATCAGTGGGGTTCCCCAAGCCTGTGTCAGGTCCGGATCTGGGTG
GGAGTTCCCTTCTGCGTCATCCAGGCCAGGCGAGTGGGCATCCTCCCTGAGCACCTGTGC
TTGGGGCTTTGCCTGTGTCAGTCAGGAAGACAGAGTACACGGAAGAGTTACCATTGCTTT
CAGAGCAAACCTTCCTTTGACATGCATTTAACACAGCACGGAGTGATTGACATGTGTCCT
TGTG

">comp7_c0_seq1 len=208 path=[22:0-207]
GGAAGGACAGCATGTTTTCCATCTCAAAGACAGGAAAGAGTTATCTCTTCCTCTGGGATC
CATCAGCATCCTGCCTACTCCTGCGTCACAGCACAGATCCTAACTGGCAAAATTATTAAT
CTCTCTTCCACTGAAATAGATACATCAGACAGATTCCTTTCTGACTGAAACTGTTCTGCT
GTGAAAGACTAACAACAAAGCAGATGCT

">comp8_c0_seq1 len=537 path=[1925:0-536]
TTAATAATTTAATTTTACTTTGAATATGTGTATATAAAATGCCTAATGTGATAAAAGTAG
AATATGCCTGGTTGAAGGAAACATAGAAAATTGAATTGCCACTGATTTGGCCTTTCCTTC
ATCTTTCATGGGGAGCCAGAGAGAATCTGGTTCAGAAGACAGACTCTAGAGTCAAGCAGC
TGGGGTTCAAATCTTGGCAACATTTCAGGGTGATTTTAAAAATATTTAACAGCTGGTAAT
GCTAGATGTCGACTTGTCAGAATGGATAAAGCCTGACATGACGTATATAGCCACACCAGC
ATATAATCAGCCCTGTCTCCACCACTTACTAGTAGTGTCTTTATCTGTAAGATAAAGATA
GCAATAGGCATTATCTCATAGGGGTTTTATGAGGATTAGGTGTAATAATATATATAAAGC
ACTTATGACAATGTTTGGAAGAAAGTGTCATTCAACATTAGATATCATCATCATTGTCAT
CATCGTGACTAATACTTGAGGAATTCCAGAATGTTATGGTTAGAATGGTAAAGTTCT

我想要的是:

> ">comp2_c0_seq1 len=265 path=[1:0-264] GTTTGAATGGTTTGTGGTTCTGCCTTTGACAAACTGATCATAGTGGAATAATAAGGGAACATGAAGAAATTCCAAGCCCATTGATTTTCTCTTGAGACCAATTAGGTAAAGTCACTCAAAATTTTTGAGAGTGGATGCTCAGAGGTAACACTTTGGCATAGAATTGTTAAATAGCATGCACTTTAATGGAAGAATAGAATCATTAAAGATTGGTTGATAACAAGTCACAGTGTATTTAACCATCATCACAGCAGATGTAGACAGA

每个序列都在同一条线上,因此我可以通过grep轻松提取序列。希望这是可能的。

感谢

2 个答案:

答案 0 :(得分:1)

awk应该:

awk '{printf (/comp/&&NR>1?"\n":"")"%s",$0}' file
">comp2_c1_seq1 len=203 path=[2794:0-202]CAGCAGATGTAGACAGAAATGGCACCACTGCTTATGAAGGAAACTGGAACCCAGAAGCACACCCTCCCCTCATTCACCATGAGCATCATGAGGAAGAAGAGACCCCACATTCTACAAGCACAAGTAAGCAAGATGGCGGTCGGCAGTTCTGGGTTAGATGAATTAGTAAAGACATTCCAGCAATAGGGAAGATTTTGTTTAGA
">comp6_c0_seq1 len=424 path=[1744:0-423]CCAGCTCCTACTCACCAGTCTCTCCGCATGGAGAAGTGGCCGTCATGGTCGACCTGTTCCCAAGGGTGGCCTTGTGAGTGCAGGCTCTCCTCACCAGAGCTGAGGGCTTTGTGAACCTCTGATGTCAATAGATGCCCCTCATCTTCCAGGAGGACAAAACAGGGCAAAGCAAGACATGGGGTGAGAACAGGAGTGCATCAGTGGGGTTCCCCAAGCCTGTGTCAGGTCCGGATCTGGGTGGGAGTTCCCTTCTGCGTCATCCAGGCCAGGCGAGTGGGCATCCTCCCTGAGCACCTGTGCTTGGGGCTTTGCCTGTGTCAGTCAGGAAGACAGAGTACACGGAAGAGTTACCATTGCTTTCAGAGCAAACCTTCCTTTGACATGCATTTAACACAGCACGGAGTGATTGACATGTGTCCTTGTG
">comp7_c0_seq1 len=208 path=[22:0-207]GGAAGGACAGCATGTTTTCCATCTCAAAGACAGGAAAGAGTTATCTCTTCCTCTGGGATCCATCAGCATCCTGCCTACTCCTGCGTCACAGCACAGATCCTAACTGGCAAAATTATTAATCTCTCTTCCACTGAAATAGATACATCAGACAGATTCCTTTCTGACTGAAACTGTTCTGCTGTGAAAGACTAACAACAAAGCAGATGCT
">comp8_c0_seq1 len=537 path=[1925:0-536]TTAATAATTTAATTTTACTTTGAATATGTGTATATAAAATGCCTAATGTGATAAAAGTAGAATATGCCTGGTTGAAGGAAACATAGAAAATTGAATTGCCACTGATTTGGCCTTTCCTTCATCTTTCATGGGGAGCCAGAGAGAATCTGGTTCAGAAGACAGACTCTAGAGTCAAGCAGCTGGGGTTCAAATCTTGGCAACATTTCAGGGTGATTTTAAAAATATTTAACAGCTGGTAATGCTAGATGTCGACTTGTCAGAATGGATAAAGCCTGACATGACGTATATAGCCACACCAGCATATAATCAGCCCTGTCTCCACCACTTACTAGTAGTGTCTTTATCTGTAAGATAAAGATAGCAATAGGCATTATCTCATAGGGGTTTTATGAGGATTAGGTGTAATAATATATATAAAGCACTTATGACAATGTTTGGAAGAAAGTGTCATTCAACATTAGATATCATCATCATTGTCATCATCGTGACTAATACTTGAGGAATTCCAGAATGTTATGGTTAGAATGGTAAAGTTCT

答案 1 :(得分:0)

您可以尝试此sed

sed '/">comp/{:loop; N; /\n">comp/{P;D}; s/\n//g; b loop;}' yourfile.txt