如何在python中从大型fasta文件中快速删除重复序列?

时间:2018-11-30 18:15:40

标签: python duplicates

我已经从NCBI数据库中下载了一些fasta文件,其中包含> 10,000个序列。

文件如下:

>lcl|AY289593.1_prot_AAQ74417.1_1 [protein=FabH-like protein] [protein_id=AAQ74417.1] [location=complement(<1..775)] [gbkey=CDS]
MRPINDIQVDGVPNDHTIVQSDYISFTEADEPATVMATRAATEALTTSELVSADVGVLIYAAIIGDAHHF
APVCHVQRVLRAPDALAFELSAASNGGTQGIAVAANLMTADAPVKAALVCTAYRHPIDIISRWSSGMVFG
DGAAAAVLSRDGGMVRLISGYHGSLPELEVLARNRSNERLGFVLPDVGLGKYLTAIARMYQAVIAQVLEE
AQTSIAEIDYFGLIGIGIPSLTATILEPNGIPVNKTSWGLLRQMGHVG
>lcl|AY289593.1_prot_AAQ74418.1_2 [protein=type I polyketide synthase loading module] [protein_id=AAQ74418.1] [location=4126..>6747] [gbkey=CDS]
MLGDAVAVVGMSCRVPGASDPDALWALLRDGISVVDEIPSARWNLDGLVAHRLTDEQRSALRHGAFLDDV
EGFDAAFFGINPSEAGSMDPQQRLMLELTWAALEDARIVPEHLSGSSSGVFTGAMSDDYTTAVTYRAAMT
AHTFAGTHRSLIANRVSYTLGLRGPSLVIDTGQSSSLVAVHVAMESLRREETSLAIAGGIHLNLSLAAAL
SAAHFGALSPDGRCYTFDARANGYVRGEGGGVVVLKRLNDALADGNHIYCVIRGSSVNNDGATQDLTAPG
VDGQRQALLQAYERAEIDPSEVQYVELHGTGTRLGDPTEAHSLHSVFGTSTVPRSPLLVGSIKTNIGHLE
GAAGILGLIKTALAVHHRQLPPSLNYTVPNPKIPLEQLGLRVQTTLSEWPDLDKPLTAGVSSFSMGGTNA
HLILQQPPTPDTTQTPNPTTGSDPAVGSDSAVGSDPAVGVLVWPLSARSAPGLSAQAARLYQHLSAHPDL
DPIDVAHSLATTRSHHPHRATITTSIEHHSENNHDTTDALAALHALANNGTHPLLSRGLLTPQGPGKTVF
VFPGQGSQYPGMGADLYRQFPVFAHALDEVAAALNPHLDVALLEVMFSQQDTAMAQLLDQTFYAQPALFA
LGTALHRLFTHAGIHPDYLLGHSIGELTAAYAAGVLSLQDAATLVTSRGRLMQSCTPGGTMLALQASEAE
VQPLLEGLDHAVSIAAINGATSIVLSGDHDSLEQIGEHFITQDRRTTRLQVSHAFHSPHMDPILEQFRQI
AAQLTFSAPTLPILSNLTGQIARHDQLASPDYWTQQLRNTVRFHDTVAALLGAGEQVFLELSPHPVLTQA
ITDTVEQAGGGGAAVPALRKDRPDAVAFAAALGQ

>lcl|AY289596.1_prot_AAQ74421.1_1 [protein=type I polyketide synthase extender module] [protein_id=AAQ74421.1] [location=<1..>4439] [gbkey=CDS]
DTACSSSLVAIHLACQSLRNNESQLALAGGVTVMSTPAVFTEFSRQRGLAPDGRCKAFAATADGTGFGEG
AAVLVLERLSEARRNNHPVLAIVAGSAINQDGASNGLTAPHGPSQQRVINQALANAGLTHDQVDAVEAHG
TGTTLGDPIEAGALHATYGHHHTPDQPLWLGSIKSNIGHTQAAAGAVGVVKMIQAITHATLPATLHVDQP
GPHIDWSSGTVRLLTEPIQWPNTNHPRTAAVSSFGISGTNAHLILQQPPTPNPTQTPEDCSPAQSPCATI
TDAGTGLSFVPWVISAKSAEALSAQASRLLTRLDDDPVVDAIDLGWSLIATRSMFEHRAVVVGADRHQLQ
RGLAELASGNLGADVVVGRARAAGETVMVFPGQGSQRLGMGAQLYEQFPVFAAAFDDVVDALDQYLRLPL
RQVMWGDDEGLLNSTEFAQPSLFAVEVALFALLRFWGVVPDYVIGHSVGELAAAQVAGVLSLQDAAKLVS
ARGRLMQALPAGGAMVAVAASQHEVEPLLVEGVDIAALNAPGSVVISGDQAAVRLIANRLADRGYRAHEL

我没有在此处列出完整文件,因为它很大并且包含重复项(请注意“ prot ”之后的字符串),所以我编写了一个脚本来删除重复项:

from Bio import SeqIO
import pandas as pd                                                               
import sys

inputFile = sys.argv[1]
inputName = inputFile.split('.')[0]

idList = []

seqiter = SeqIO.parse(inputFile, 'fasta')

sys.stdout = open(inputName + '_nodup.fasta', 'w')                               

for record in seqiter:
    if record.description not in idList:
        idList.append(record.description)
        SeqIO.write(record,sys.stdout, "fasta")

sys.stdout.close()            

它可以完成工作,但是速度很慢。

我认为应该有一个更明智的方法,任何专家都可以帮助您?谢谢!

0 个答案:

没有答案