如何加入fasta文件和txt文件?

时间:2017-05-30 18:21:55

标签: join awk

我有一个看起来像这样的fasta文件:

> ASst1|LK||eukaryota|Homo sapiens
YYNRLINTLLDNGIEPIVSIYHWDLPQRLQDLGGWPNIVLAIYTENYARVLFKNFGDRVK
LWITFNEPRIFMGGYTSDTGMAPSINTPGIGDYLTSRTVLIAHANIYHMYEREFKQQQKG
KIGITLTGFWCEPLTPDFTERCERYQQFQLGLYAHPIFTGHGDYPSVVIERVDNNSKVEG
FTTSRLPKLTSEEVNYIKGTYDFFGINFYTAQVGLNGVVGGIPSRERDMGTIVLQDPNWP
> >ASstj1|TH1||eukaryota|Mus musculus 
FWLVVSQLLYFPRDAHCLADIPSEAILDNNIPLINNLTFPDGFLFGAATAAYQIEGAWN
VDGKGPSIWDEFTHTHPEIITDHSTGDDACKSYYKYKEDVQAAKTMGLDSYRFSMSWPRI
MPTGFPDNINQKGIDYYNNLINELVDNGIMPLVTMYHWDLPQNLQTYGGWLNESIVPLYV
SYARVLFENFGDRVKWWLTFNEPQFVSLGYEFRVMAPGIFTNGTGPYIASTNVLKAHA

我有另一个包含信息的文件:

Homo sapiens    9606    cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens 

Mus musculus    10090   cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Glires;Rodentia;Myomorpha;Muroidea;Muridae;Murinae;Mus;Mus;Mus musculus

我想查看这两个文件,如下所示:

> ASst1|LK||eukaryota|Homo sapiens cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Primates;Haplorrhini;Simiiformes;Catarrhini;Hominoidea;Hominidae;Homininae;Homo;Homo sapiens
YYNRLINTLLDNGIEPIVSIYHWDLPQRLQDLGGWPNIVLAIYTENYARVLFKNFGDRVK
LWITFNEPRIFMGGYTSDTGMAPSINTPGIGDYLTSRTVLIAHANIYHMYEREFKQQQKG
KIGITLTGFWCEPLTPDFTERCERYQQFQLGLYAHPIFTGHGDYPSVVIERVDNNSKVEG
FTTSRLPKLTSEEVNYIKGTYDFFGINFYTAQVGLNGVVGGIPSRERDMGTIVLQDPNWP
> >ASstj1|TH1||eukaryota|Mus musculus cellular organisms;Eukaryota;Opisthokonta;Metazoa;Eumetazoa;Bilateria;Deuterostomia;Chordata;Craniata;Vertebrata;Gnathostomata;Teleostomi;Euteleostomi;Sarcopterygii;Dipnotetrapodomorpha;Tetrapoda;Amniota;Mammalia;Theria;Eutheria;Boreoeutheria;Euarchontoglires;Glires;Rodentia;Myomorpha;Muroidea;Muridae;Murinae;Mus;Mus;Mus musculus
FWLVVSQLLYFPRDAHCLADIPSEAILDNNIPLINNLTFPDGFLFGAATAAYQIEGAWN
VDGKGPSIWDEFTHTHPEIITDHSTGDDACKSYYKYKEDVQAAKTMGLDSYRFSMSWPRI
MPTGFPDNINQKGIDYYNNLINELVDNGIMPLVTMYHWDLPQNLQTYGGWLNESIVPLYV
SYARVLFENFGDRVKWWLTFNEPQFVSLGYEFRVMAPGIFTNGTGPYIASTNVLKAHA

我在想,在这个例子中,加入是行不通的。如果我首先将标题解析为单独的列表,即grep>,然后加入这两个文件,它将起作用。但我真的需要下面打印的序列。任何想法都会有所帮助。

1 个答案:

答案 0 :(得分:1)

尝试以下方法:

awk -F'[\t|]' '
  FNR==NR { dict[$1]=$3; next }
  /^> / { $0 = $0 " " dict[$NF] }
  { print }
' fileLookup fileFasta

假设:

  • 您的查找文件以制表符分隔。

  • fasta文件样本中Mus musculus之后的尾随空格不在真实数据文件中。