我已经从NCBI数据库中下载了一些fasta文件,其中包含> 10,000个序列。
文件如下:
>lcl|AY289593.1_prot_AAQ74417.1_1 [protein=FabH-like protein] [protein_id=AAQ74417.1] [location=complement(<1..775)] [gbkey=CDS]
MRPINDIQVDGVPNDHTIVQSDYISFTEADEPATVMATRAATEALTTSELVSADVGVLIYAAIIGDAHHF
APVCHVQRVLRAPDALAFELSAASNGGTQGIAVAANLMTADAPVKAALVCTAYRHPIDIISRWSSGMVFG
DGAAAAVLSRDGGMVRLISGYHGSLPELEVLARNRSNERLGFVLPDVGLGKYLTAIARMYQAVIAQVLEE
AQTSIAEIDYFGLIGIGIPSLTATILEPNGIPVNKTSWGLLRQMGHVG
>lcl|AY289593.1_prot_AAQ74418.1_2 [protein=type I polyketide synthase loading module] [protein_id=AAQ74418.1] [location=4126..>6747] [gbkey=CDS]
MLGDAVAVVGMSCRVPGASDPDALWALLRDGISVVDEIPSARWNLDGLVAHRLTDEQRSALRHGAFLDDV
EGFDAAFFGINPSEAGSMDPQQRLMLELTWAALEDARIVPEHLSGSSSGVFTGAMSDDYTTAVTYRAAMT
AHTFAGTHRSLIANRVSYTLGLRGPSLVIDTGQSSSLVAVHVAMESLRREETSLAIAGGIHLNLSLAAAL
SAAHFGALSPDGRCYTFDARANGYVRGEGGGVVVLKRLNDALADGNHIYCVIRGSSVNNDGATQDLTAPG
VDGQRQALLQAYERAEIDPSEVQYVELHGTGTRLGDPTEAHSLHSVFGTSTVPRSPLLVGSIKTNIGHLE
GAAGILGLIKTALAVHHRQLPPSLNYTVPNPKIPLEQLGLRVQTTLSEWPDLDKPLTAGVSSFSMGGTNA
HLILQQPPTPDTTQTPNPTTGSDPAVGSDSAVGSDPAVGVLVWPLSARSAPGLSAQAARLYQHLSAHPDL
DPIDVAHSLATTRSHHPHRATITTSIEHHSENNHDTTDALAALHALANNGTHPLLSRGLLTPQGPGKTVF
VFPGQGSQYPGMGADLYRQFPVFAHALDEVAAALNPHLDVALLEVMFSQQDTAMAQLLDQTFYAQPALFA
LGTALHRLFTHAGIHPDYLLGHSIGELTAAYAAGVLSLQDAATLVTSRGRLMQSCTPGGTMLALQASEAE
VQPLLEGLDHAVSIAAINGATSIVLSGDHDSLEQIGEHFITQDRRTTRLQVSHAFHSPHMDPILEQFRQI
AAQLTFSAPTLPILSNLTGQIARHDQLASPDYWTQQLRNTVRFHDTVAALLGAGEQVFLELSPHPVLTQA
ITDTVEQAGGGGAAVPALRKDRPDAVAFAAALGQ
>lcl|AY289596.1_prot_AAQ74421.1_1 [protein=type I polyketide synthase extender module] [protein_id=AAQ74421.1] [location=<1..>4439] [gbkey=CDS]
DTACSSSLVAIHLACQSLRNNESQLALAGGVTVMSTPAVFTEFSRQRGLAPDGRCKAFAATADGTGFGEG
AAVLVLERLSEARRNNHPVLAIVAGSAINQDGASNGLTAPHGPSQQRVINQALANAGLTHDQVDAVEAHG
TGTTLGDPIEAGALHATYGHHHTPDQPLWLGSIKSNIGHTQAAAGAVGVVKMIQAITHATLPATLHVDQP
GPHIDWSSGTVRLLTEPIQWPNTNHPRTAAVSSFGISGTNAHLILQQPPTPNPTQTPEDCSPAQSPCATI
TDAGTGLSFVPWVISAKSAEALSAQASRLLTRLDDDPVVDAIDLGWSLIATRSMFEHRAVVVGADRHQLQ
RGLAELASGNLGADVVVGRARAAGETVMVFPGQGSQRLGMGAQLYEQFPVFAAAFDDVVDALDQYLRLPL
RQVMWGDDEGLLNSTEFAQPSLFAVEVALFALLRFWGVVPDYVIGHSVGELAAAQVAGVLSLQDAAKLVS
ARGRLMQALPAGGAMVAVAASQHEVEPLLVEGVDIAALNAPGSVVISGDQAAVRLIANRLADRGYRAHEL
我没有在此处列出完整文件,因为它很大并且包含重复项(请注意“ prot ”之后的字符串),所以我编写了一个脚本来删除重复项:
from Bio import SeqIO
import pandas as pd
import sys
inputFile = sys.argv[1]
inputName = inputFile.split('.')[0]
idList = []
seqiter = SeqIO.parse(inputFile, 'fasta')
sys.stdout = open(inputName + '_nodup.fasta', 'w')
for record in seqiter:
if record.description not in idList:
idList.append(record.description)
SeqIO.write(record,sys.stdout, "fasta")
sys.stdout.close()
它可以完成工作,但是速度很慢。
我认为应该有一个更明智的方法,任何专家都可以帮助您?谢谢!