我在表单中有一个文件(Input_fasta.txt)
>tr|A0A089QH62|A0A089QH62_MYCTU Histidine kinase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00865 PE=4 SV=1
MTATASGIAATAPNCGEASINDVPIAESERRYLGARSASEYGQEIPLW
>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
>tr|A0A089SBT4|A0A089SBT4_MYCTU Glycosyl transferase OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_19775 PE=4 SV=1
MDTETHYSDVWVVIPAFNEAAVIGKVVTDVRSVFDHVVCVDDGSTDGTGDIARRSGAHLV
RHPINLGQGAAIQTGIEYARKQPGAQVFATFDGDGQHRVKDVAAMVDRLGAGDVDVVIGT
RFGRPVGKASASRPPLMKRIVLQTGARLSRRGRRLGLTDTNNGLRVFNKTVADGLNITMS
GMSHATEFIMLIAENHWRVAEEPVEVLYTEYSKSKGQPLLNGVNIIFDGFLRGRMPR
>tr|A0A089QKT1|A0A089QKT1_MYCTU TetR family transcriptional regulator OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_00800 PE=4 SV=1
MSLTAGRGPGRPPAAKADETRKRILHAARQVFSERGYDGATFQEIAVRADLTRPAINHYF
ANKRVLYQEVVEQTHELVIVAGIERARREPTLMGRLAVVVDFAMEADAQYPASTAFLATT
VLESQRHPELSRTENDAVRATREFLVWAVNDAIERGELAADVDVSSLAETLLVVLCGVGF
YIGFVGSYQRMATITDSFQQLLAGTLWRPPT
>tr|I6YAB3|I6YAB3_MYCTU Iron ABC transporter permease OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=LH57_07380 PE=4 SV=1
MARGLQGVMLRSFGARDHTATVIETISIAPHFVRVRMVSPTLFQDAEAEPAAWLRFWFPD
PNGSNTEFQRAYTISEADPAAGRFAVDVVLHDPAGPASSWARTVKPGATIAVMSLMGSSR
FDVPEEQPAGYLLIGDSASIPGMNGIIETVPNDVPIEMYLEQHDDNDTLIPLAKHPRLRV
RWVMRRDEKSLAEAIENRDWSDWYAWATPEAAALKCVRVRLRDEFGFPKSEIHAQAYWNA
GRAMGTHRATEPAATEPEVGAAPQPESAVPAPARGSWRAQAASRLLAPLKLPLVLSGVLA
ALVTLAQLAPFVLLVELSRLLVSGAGAHRLFTVGFAAVGLLGTGALLAAALTLWLHVIDA
RFARALRLRLLSKLSRLPLGWFTSRGSGSIKKLVTDDTLALHYLVTHAVPDAVAAVVAPV
GVLVYLFVVDWRVALVLFGPVLVYLTITSSLTIQSGPRIVQAQRWAEKMNGEAGSYLEGQ
PVIRVFGAASSSFRRRLDEYIGFLVAWQRPLAGKKTLMDLATRPATFLWLIAATGTLLVA
THRMDPVNLLPFMFLGTTFGARLLGIAYGLGGLRTGLLAARHLQVTLDETELAVREHPRE
PLDGEAPATVVFDHVTFGYRPGVPVIQDVSLTLRPGTVTALVGPSGSGKSTLATLLARFH
DVERGAIRVGGQDIRSLAADELYTRVGFVLQEAQLVHGTAAENIALAVPDAPAEQVQVAA
REAQIHDRVLRLPDGYDTVLGANSGLSGGERQRLTIARAILGDTPVLILDEATAFADPES
EYLVQQALNRLTRDRTVLVIAHRLHTITRADQIVVLDHGRIVERGTHEELLAAGGRYCRL
WDTGQGSRVAVAAAQDGTR
>tr|L0T545|L0T545_MYCTU PPE family protein PPE7 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=PPE7 PE=4 SV=1
MSVCVIYIPFKGCVKHVSVTIPITTEHLGPYEIDASTINPDQPIDTAFTQTLDFAGSGTV
GAFPFGFGWQQSPGFFNSTTTPSSGFFNSGAGGASGFLNDAAAAVSGLGNVFTETSGFFN
AGGVGIRASKTSATCCRAGRT
和另一个包含模式的文件,如(Pattern.txt)
I6WXB4
I6WXC3
I6WXK8
我需要像
这样的输出>tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
直到现在我所做的是
grep -f Pattern.txt Input_fasta.txt
如何将输出扩展到下一行直到我点击下一行">"比赛结束后?
尝试awk '/I6WXB4/{copy=1;next} />/{copy=0;next} copy' Input_fasta.txt
提供输出
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
但是这里缺少标题。
答案 0 :(得分:1)
在awk中:
$ awk 'NR==FNR{a[$0]; next} $2 in a' pattern.txt FS="|" RS=">" input_fasta.tzt
tr|I6WXB4|I6WXB4_MYCTU 30S ribosomal protein S6 OS=Mycobacterium tuberculosis (strain ATCC 25618 / H37Rv) GN=rpsF PE=3 SV=1
MRPYEIMVILDPTLDERTVAPSLETFLNVVRKDGGKVEKVDIWGKRRLAYEIAKHAEGIY
VVIDVKAAPATVSELDRQLSLNESVLRTKVMRTDKH
答案 1 :(得分:0)
这是一个简单的Python解决方案,使用BioPython:
import sys
import re
from Bio import SeqIO
with open('pattern.txt', 'r') as f:
patterns = '|'.join([re.escape(pattern.strip()) for pattern in f])
for record in SeqIO.parse('test.fa', 'fasta'):
if re.search(patterns, record.id):
SeqIO.write(record, sys.stdout, 'fasta')
请注意,这需要一个行为良好的patterns.txt
文件,即不包含任何空行的文件。
答案 2 :(得分:0)
bash& sed解决方案:
while read pattern
do
if [ ! -z $pattern ] ; then
sed -n "/\|$pattern\|/{:loop;p;n;/>/q;bloop;}" input.txt
fi
done < patternfile.txt
在模式文件上循环(跳过空白行),如果找到模式,只需阅读&amp;打印文件行直到结束或直到找到>