我有一个这样的文件(+10000序列,+ 98000行):
>DILT_0000000001-mRNA-1
MKVVKICSKLRKFIESRKDAVLPEQEEVLADLWAFEGISEFQMERFAKAAQCFQHQYELA
IKANLTEHASRSLENLGRARARLYDYQGALDAWTKRLDYEIKGIDKAWLHHEIGRAYLEL
NQYEEAIDHAATARDVADREADMEWDLNATVLIAQAHFYAGNLEEAKVYFEAAQNAAFRK
GFFKAESVLAEAIAEVDSEIRREEAKQERVYTKHSVLFNEFSQRAVWSEEYSEELHLFPF
AVVMLRCVLARQCTVHLQFRSCYNL
>DILT_0000000101-mRNA-1
MSCRRLSMNPGEALIKESSAPSRENLLKPYFDEDRCKFRHLTAEQFSDIWSHFDLDGVNE
LRFILRVPASQQAGTGLRFFGYISTEVYVHKTVKVSYIGFRKKNNSRALRRWNVNKKCSN
AVQMCGTSQLLAIVGPHTQPLTNKLCHTDYLPLSANFA
>DILT_0001999301-mRNA-1
LEHGIQPDGQMPSDKTIGGGDDSFQTFFSETGAGKHVPRAVMVDLEPTVIGEYLCVLLTS
FILFRLISTNLGPNSQLASRTLLFAADKTTLFRLLGLLPWSLLKIAVQ
>DILT_0001999401-mRNA-1
MAENGEDANMPEEGKEGNTQDQGEHQQDVQSDEPNEADSGYSSAASSDVNSQTIPITVIL
PNREAVNLSFDPNISVSELQERLNGPGITRLNENLFFTYSGKQLDPNKTLLDYKVQKSST
LYVHETPTALPKSAPNAKEEGVVPSNCLIHSGSRMDENRCLKEYQLTQNSVIFVHRPTAN
TAVQNREEKTSSLEVTVTIRETGNQLHLPINPHXXXXTVEMHVAPGVTVGDLNRKIAIKQ
带有'>'的所有行是ID。以下行是关于ID的序列。
我还有一个文件,其中包含我想要的序列的ID,例如:
DILT_0000000001-mRNA-1
DILT_0000000101-mRNA-1
DILT_0000000201-mRNA-1
DILT_0000000301-mRNA-1
DILT_0000000401-mRNA-1
DILT_0000000501-mRNA-1
DILT_0000000601-mRNA-1
DILT_0000000701-mRNA-1
DILT_0000000801-mRNA-1
DILT_0000000901-mRNA-1
我想编写一个匹配id的脚本并复制这些ID的序列,但我只是获取ID而没有序列。
seqs = open('WBPS10.protein.fa').readlines()
ids = open('ids.txt').readlines()
for line in ids:
for record in seqs:
if line == record[1:]:
print record
我不知道写些什么来获得“' n' ID之后的行,因为有时它是2行,其他序列有更多,如我在我的例子中所见。
问题是,我试图在不使用Biopython的情况下做到这一点,这会更容易。我只是想学习其他方法。
答案 0 :(得分:1)
seqs_by_ids = {}
with open('WBPS10.protein.fa', 'r') as read_file:
for line in read_file.readlines():
if line.startswith('>'):
current_key = line[1:].strip()
seqs_by_ids[current_key] = ''
else:
seqs_by_ids[current_key] += line.strip()
ids = set([line.strip() for line in open('ids.txt').readlines()])
for id in ids:
if id in seqs_by_ids:
print(id)
print('\t{}'.format(seqs_by_ids[id]))
输出:
DILT_0000000001-mRNA-1
MKVVKICSKLRKFIESRKDAVLPEQEEVLADLWAFEGISEFQMERFAKAAQCFQHQYELAIKANLTEHASRSLENLGRARARLYDYQGALDAWTKRLDYEIKGIDKAWLHHEIGRAYLELNQYEEAIDHAATARDVADREADMEWDLNATVLIAQAHFYAGNLEEAKVYFEAAQNAAFRKGFFKAESVLAEAIAEVDSEIRREEAKQERVYTKHSVLFNEFSQRAVWSEEYSEELHLFPFAVVMLRCVLARQCTVHLQFRSCYNL
DILT_0000000101-mRNA-1
MSCRRLSMNPGEALIKESSAPSRENLLKPYFDEDRCKFRHLTAEQFSDIWSHFDLDGVNELRFILRVPASQQAGTGLRFFGYISTEVYVHKTVKVSYIGFRKKNNSRALRRWNVNKKCSNAVQMCGTSQLLAIVGPHTQPLTNKLCHTDYLPLSANFA
答案 1 :(得分:0)
这对你有用。如果字符串中有一些特殊字符,例如\ r \ n,则if line == record[1:]:
语句将不起作用。您有兴趣仅查找匹配的ID。以下代码适合您。
代码示例
seqs = open('WBPS10.protein.fa').readlines()
ids = open('ids.txt').readlines()
for line in ids:
for record in seqs:
if line in record :
print record
输出
>DILT_0000000001-mRNA-1
>DILT_0000000101-mRNA-1