处理FASTA文件以输出标题和所有匹配的子序列

时间:2017-10-22 04:13:08

标签: regex perl header unique fasta

我需要在FASTA文件中搜索DNA序列的某些区域。对于每个匹配,我需要打印序列标题,然后打印该序列中的所有匹配项。我想打印标题一次,然后是匹配的部分。

以下代码的输出很接近,但DNA的匹配区域在上方标题上打印而不是在其下方。我无法翻转这两段代码,因为这会切断第一批结果。

# First, I open my file and print a warning if it fails.
unless ( open FILE, "<", '/scratch/SampleDataFiles/test.fasta' ) {
    die "Sorry", $!;
}

$/ = ">";    # This changes the record separator from \n to >, so I can chomp it later.

my @file = <FILE>;
my $file = "@file";
chomp $file;

# To view the file I can--
# print $file;

my $count = 0;    # here I will count the matched regions

my $sequence_count = 0;    # here I will count the sequences
                           # that contain a matched region

foreach $file ( @file ) {

    # I look for each header and its following sequence
    # And count the total sequences in the file
    if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {

        my $head     = $1;
        my $sequence = $2;

        $sequence_count = $sequence_count + 1;

        # Now, I use the sequences I matched and search for a
        # hydrophobic region

        while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {

            # I want to know what the position of the match is
            my $pos = pos( $sequence ) - 7;

            print "\n", $1, " found at ", $pos;
        }

        # I use the count variable I made earlier to count up each
        # time I match a sequence that has one or more hydrophobic region

        if ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {

            print "\n",
                    "Hydrophobic region(s) found in ",
                    $head,
                    "\n",
                    "-------------------------------------",
                    "\n";

            $count = $count + 1;
        }

    }
}

print "Hydrophobic region(s) found in ",
        $count,
        " out of ",
        $sequence_count,
        " sequences.",
        "\n",
        "\n";

这是输出:

AVVAAVMW found at 325
Hydrophobic region(s) found in P30450 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 365 |    Name=HLA-A; Synonyms=HLAA;

-------------------------------------

VAVLMLCL found at 170
LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970
Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742;

-------------------------------------
Hydrophobic region(s) found in 2 out of 15 sequences.

如果我切换它,这是我得到的输出:

Hydrophobic region(s) found in P30450 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 365 |    Name=HLA-A; Synonyms=HLAA;

-------------------------------------

Hydrophobic region(s) found in A7MBM2 | Homo sapiens (Human). | 
NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742;


LLALVAIF found at 493
IWICWFAA found at 705
LALALAFA found at 970

Hydrophobic region(s) found in 2 out of 15 sequences.` 

根据我老师的建议,我已按如下方式调整了我的代码,以便在较大的while循环中包含所有内容,并使用计数器限制打印数量。这个新代码打印每个新标题一次,在它下面打印一个找到的DNA区域的每个实例(基本上翻转我以前的)。

新代码:

    my $count      = 0;    # here I will count the matched regions
    my $temp_count = 0;    # this I will use temporarily to count

    my $sequence_count = 0;    # here I will count the sequences
                               # that contain a matched region

    if ( $file =~ /(.*;.*;?\n)(\w+)/ ) {

        my $head     = $1;
        my $sequence = $2;

        $sequence_count = $sequence_count + 1;

        # Now I use the sequences that I found, and
        # search them for a hydrophobic region
        while ( $sequence =~ /([VILMFWCA]{8,}?)/gi ) {

            # I use the count variables I made earlier
            # I count all times I match a sequence that has one or more hydrophobic region
            $temp_count = $temp_count + 1;

            # But I don't want the header repeated for the same sequence, so I limit the
            # times that it can print
            if ( $temp_count <= 2 ) {
                print "\n", "Hydrophobic region(s) found in ", $head, "\n";
                $count = $count + 1;
            }

            # I want to know what the position of the match is
            # within the sequence
            my $pos = pos( $sequence ) - 7;
            print $1, " found at ", $pos, "\n", "\n";
        }
    }
}

print "\n",
        "\n",
        "-------------------------",
        "\n",
        "Hydrophobic region(s) found in ",
        $count,
        " out of ",
        $sequence_count,
        " sequences.",
        "\n",
        "\n";

如果有用,这是文件的样子:

>P31946 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 |    Name=YWHAB;
MTMDKSELVQKAKLAEQAERYDDMAAAMKAVTEQGHELSNEERNLLSVAYKNVVGARRSSWRVISSIEQKTERNEKKQQMGKEYREKIEAELQDICNDVLELLDKYLIPNATQPESKVFYLKMKGDYFRYLSEVASGDNKQTTVSNSQQAYQEAFEISKKEMQPTHPIRL
GLALNFSVFYYEILNSPEKACSLAKTAFDEAIAELDTLNEESYKDSTLIMQLLRDNLTLWTSENQGDEGDAGEGEN
>P62258 | Homo sapiens (Human). | NCBI_TaxID=9606; | 255 |    Name=YWHAE;
MDDREDLVYQAKLAEQAERYDEMVESMKKVAGMDVELTVEERNLLSVAYKNVIGARRASWRIISSIEQKEENKGGEDKLKMIREYRQMVETELKLICCDILDVLDKHLIPAANTGESKVFYYKMKGDYHRYLAEFATGNDRKEAAENSLVAYKAASDIAMTELPPTHPIR
LGLALNFSVFYYEILNSPDRACRLAKAAFDDAIAELDTLSEESYKDSTLIMQLLRDNLTLWTSDMQGDGEEQNKEALQDVEDENQ
>Q04917 | Homo sapiens (Human). | NCBI_TaxID=9606; | 246 |    Name=YWHAH; Synonyms=YWHA1;
MGDREQLLQRARLAEQAERYDDMASAMKAVTELNEPLSNEDRNLLSVAYKNVVGARRSSWRVISSIEQKTMADGNEKKLEKVKAYREKIEKELETVCNDVLSLLDKFLIKNCNDFQYESKVFYLKMKGDYYRYLAEVASGEKKNSVVEASEAAYKEAFEISKEQMQPTHP
IRLGLALNFSVFYYEIQNAPEQACLLAKQAFDDAIAELDTLNEDSYKDSTLIMQLLRDNLTLWTSDQQDEEAGEGN
>P30450 | Homo sapiens (Human). | NCBI_TaxID=9606; | 365 |    Name=HLA-A; Synonyms=HLAA;
MAVMAPRTLVLLLSGALALTQTWAGSHSMRYFYTSVSRPGRGEPRFIAVGYVDDTQFVRFDSDAASQRMEPRAPWIEQEGPEYWDRNTRNVKAHSQTDRANLGTLRGYYNQSEDGSHTIQRMYGCDVGPDGRFLRGYQQDAYDGKDYIALNEDLRSWTAADMAAQITQRK
WETAHEAEQWRAYLEGRCVEWLRRYLENGKETLQRTDAPKTHMTHHAVSDHEATLRCWALSFYPAEITLTWQRDGEDQTQDTELVETRPAGDGTFQKWASVVVPSGQEQRYTCHVQHEGLPKPLTLRWEPSSQPTIPIVGIIAGLVLFGAVIAGAVVAAVMWRRKSSDRK
GGSYSQAASSDSAQGSDMSLTACKV
>Q156A1 | Homo sapiens (Human). | NCBI_TaxID=9606; | 80 |    Name=ATXN8;
MQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQQ
>Q9UQB9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 309 |    Name=AURKC; Synonyms=AIE2, AIK3, ARK3, STK13;
MSSPRAVVQLGKAQPAGEELATANQTAQQPSSPAMRRLTVDDFEIGRPLGKGKFGNVYLARLKESHFIVALKVLFKSQIEKEGLEHQLRREIEIQAHLQHPNILRLYNYFHDARRVYLILEYAPRGELYKELQKSEKLDEQRTATIIEELADALTYCHDKKVIHRDIKPE
NLLLGFRGEVKIADFGWSVHTPSLRRKTMCGTLDYLPPEMIEGRTYDEKVDLWCIGVLCYELLVGYPPFESASHSETYRRILKVDVRFPLSMPLGARDLISRLLRYQPLERLPLAQILKHPWVQAHSRRVLPPCAQMAS
>O75366 | Homo sapiens (Human). | NCBI_TaxID=9606; | 819 |    Name=AVIL;
MPLTSAFRAVDNDPGIIVWRIEKMELALVPVSAHGNFYEGDCYVILSTRRVASLLSQDIHFWIGKDSSQDEQSCAAIYTTQLDDYLGGSPVQHREVQYHESDTFRGYFKQGIIYKQGGVASGMKHVETNTYDVKRLLHVKGKRNIRATEVEMSWDSFNRGDVFLLDLGKV
IIQWNGPESNSGERLKAMLLAKDIRDRERGGRAEIGVIEGDKEAASPELMKVLQDTLGRRSIIKPTVPDEIIDQKQKSTIMLYHISDSAGQLAVTEVATRPLVQDLLNHDDCYILDQSGTKIYVWKGKGATKAEKQAAMSKALGFIKMKSYPSSTNVETVNDGAESAMFK
QLFQKWSVKDQTMGLGKTFSIGKIAKVFQDKFDVTLLHTKPEVAAQERMVDDGNGKVEVWRIENLELVPVEYQWYGFFYGGDCYLVLYTYEVNGKPHHILYIWQGRHASQDELAASAYQAVEVDRQFDGAAVQVRVRMGTEPRHFMAIFKGKLVIFEGGTSRKGNAEPDP
PVRLFQIHGNDKSNTKAVEVPAFASSLNSNDVFLLRTQAEHYLWYGKGSSGDERAMAKELASLLCDGSENTVAEGQEPAEFWDLLGGKTPYANDKRLQQEILDVQSRLFECSNKTGQFVVTEITDFTQDDLNPTDVMLLDTWDQVFLWIGAEANATEKESALATAQQYLH
THPSGRDPDTPILIIKQGFEPPIFTGWFLAWDPNIWSAGKTYEQLKEELGDAAAIMRITADMKNATLSLNSNDSEPKYYPIAVLLKNQNQELPEDVNPAKKENYLSEQDFVSVFGITRGQFAALPGWKQLQMKKEKGLF
>Q9UPA5 | Homo sapiens (Human). | NCBI_TaxID=9606; | 3926 |    Name=BSN; Synonyms=KIAA0434, ZNF231;
MGNEVSLEGGAGDGPLPPGGAGPGPGPGPGPGAGKPPSAPAGGGQLPAAGAARSTAVPPVPGPGPGPGPGPGPGSTSRRLDPKEPLGNQRAASPTPKQASATTPGHESPRETRAQGPAGQEADGPRRTLQVDSRTQRSGRSPSVSPDRGSTPTSPYSVPQIAPLPSSTLC
PICKTSDLTSTPSQPNFNTCTQCHNKVCNQCGFNPNPHLTQVKEWLCLNCQMQRALGMDMTTAPRSKSQQQLHSPALSPAHSPAKQPLGKPDQERSRGPGGPQPGSRQAETARATSVPGPAQAAAPPEVGRVSPQPPQPTKPSTAEPRPPAGEAPAKSATAVPAGLGATE
QTQEGLTGKLFGLGASLLTQASTLMSVQPEADTQGQPAPSKGTPKIVFNDASKEAGPKPLGSGPGPGPAPGAKTEPGARMGPGSGPGALPKTGGTTSPKHGRAEHQAASKAAAKPKTMPKERAICPLCQAELNVGSKSPANYNTCTTCRLQVCNLCGFNPTPHLVEKTEW
LCLNCQTKRLLEGSLGEPTPLPPPTSQQPPVGAPHRASGTSPLKQKGPQGLGQPSGPLPAKASPLSTKASPLPSKASPQAKPLRASEPSKTPSSVQEKKTRVPTKAEPMPKPPPETTPTPATPKVKSGVRRAEPATPVVKAVPEAPKGGEAEDLVGKPYSQDASRSPQSL
SDTGYSSDGISSSQSEITGVVQQEVEQLDSAGVTGPHPPSPSEIHKVGSSMRPLLQAQGLAPSERSKPLSSGTGEEQKQRPHSLSITPEAFDSDEELEDILEEDEDSAEWRRRREQQDTAESSDDFGSQLRHDYVEDSSEGGLSPLPPQPPARAAELTDEDFMRRQILEM
SAEEDNLEEDDTATSGRGLAKHGTQKGGPRPRPEPSQEPAALPKRRLPHNATTGYEELLPEGGSAEATDGSGTLQGGLRRFKTIELNSTGSYGHELDLGQGPDPSLDREPELEMESLTGSPEDRSRGEHSSTLPASTPSYTSGTSPTSLSSLEEDSDSSPSRRQRLEEAK
QQRKARHRSHGPLLPTIEDSSEEEELREEEELLREQEKMREVEQQRIRSTARKTRRDKEELRAQRRRERSKTPPSNLSPIEDASPTEELRQAAEMEELHRSSCSEYSPSPSLDSEAEALDGGPSRLYKSGSEYNLPTFMSLYSPTETPSGSSTTPSSGRPLKSAEEAYEE
MMRKAELLQRQQGQAAGARGPHGGPSQPTGPRGLGSFEYQDTTDREYGQAAQPAAEGTPASLGAAVYEEILQTSQSIVRMRQASSRDLAFAEDKKKEKQFLNAESAYMDPMKQNGGPLTPGTSPTQLAAPVSFSTPTSSDSSGGRVIPDVRVTQHFAKETQDPLKLHSSP
ASPSSASKEIGMPFSQGPGTPATTAVAPCPAGLPRGYMTPASPAGSERSPSPSSTAHSYGHSPTTANYGSQTEDLPQAPSGLAAAGRAAREKPLSASDGEGGTPQPSRAYSYFASSSPPLSPSSPSESPTFSPGKMGPRATAEFSTQTPSPAPASDMPRSPGAPTPSPMV
AQGTQTPHRPSTPRLVWQESSQEAPFMVITLASDASSQTRMVHASASTSPLCSPTETQPTTHGYSQTTPPSVSQLPPEPPGPPGFPRVPSAGADGPLALYGWGALPAENISLCRISSVPGTSRVEPGPRTPGTAVVDLRTAVKPTPIILTDQGMDLTSLAVEARKYGLAL
DPIPGRQSTAVQPLVINLNAQEHTFLATATTVSITMASSVFMAQQKQPVVYGDPYQSRLDFGQGGGSPVCLAQVKQVEQAVQTAPYRSGPRGRPREAKFARYNLPNQVAPLARRDVLITQMGTAQSIGLKPGPVPEPGAEPHRATPAELRSHALPGARKPHTVVVQMGEG
TAGTVTTLLPEEPAGALDLTGMRPESQLACCDMVYKLPFGSSCTGTFHPAPSVPEKSMADAAPPGQSSSPFYGPRDPEPPEPPTYRAQGVVGPGPHEEQRPYPQGLPGRLYSSMSDTNLAEAGLNYHAQRIGQLFQGPGRDSAMDLSSLKHSYSLGFADGRYLGQGLQYG
SVTDLRHPTDLLAHPLPMRRYSSVSNIYSDHRYGPRGDAVGFQEASLAQYSATTAREISRMCAALNSMDQYGGRHGSGGGGPDLVQYQPQHGPGLSAPQSLVPLRPGLLGNPTFPEGHPSPGNLAQYGPAAGQGTAVRQLLPSTATVRAADGMIYSTINTPIAATLPITT
QPASVLRPMVRGGMYRPYASGGITAVPLTSLTRVPMIAPRVPLGPTGLYRYPAPSRFPIASSVPPAEGPVYLGKPAAAKAPGAGGPSRPEMPVGAAREEPLPTTTPAAIKEAAGAPAPAPLAGQKPPADAAPGGGSGALSRPGFEKEEASQEERQRKQQEQLLQLERERV
ELEKLRQLRLQEELERERVELQRHREEEQLLVQRELQELQTIKHHVLQQQQEERQAQFALQREQLAQQRLQLEQIQQLQQQLQQQLEEQKQRQKAPFPAACEAPGRGPPLAAAELAQNGQYWPPLTHAAFIAMAGPEGLGQPREPVLHRGLPSSASDMSLQTEEQWEASR
SGIKKRHSMPRLRDACELESGTEPCVVRRIADSSVQTDDEDGESRYLLSRRRRARRSADCSVQTDDEDSAEWEQPVRRRRSRLPRHSDSGSDSKHDATASSSSAAATVRAMSSVGIQTISDCSVQTEPDQLPRVSPAIHITAATDPKVEIVRYISAPEKTGRGESLACQT
EPDGQAQGVAGPQLVGPTAISPYLPGIQIVTPGPLGRFEKKKPDPLEIGYQAHLPPESLSQLVSRQPPKSPQVLYSPVSPLSPHRLLDTSFASSERLNKAHVSPQKHFTADSALRQQTLPRPMKTLQRSLSDPKPLSPTAEESAKERFSLYQHQGGLGSQVSALPPNSLV
RKVKRTLPSPPPEEAHLPLAGQASPQLYAASLLQRGLTGPTTVPATKASLLRELDRDLRLVEHESTKLRKKQAELDEEEKEIDAKLKYLELGITQRKESLAKDRGGRDYPPLRGLGEHRDYLSDSELNQLRLQGCTTPAGQFVDFPATAAAPATPSGPTAFQQPRFQPPA
PQYSAGSGGPTQNGFPAHQAPTYPGPSTYPAPAFPPGASYPAEPGLPNQQAFRPTGHYAGQTPMPTTQSTLFPVPADSRAPLQKPRQTSLADLEQKVPTNYEVIASPVVPMSSAPSETSYSGPAVSSGYEQGKVPEVPRAGDRGSVSQSPAPTYPSDSHYTSLEQNVPRN
YVMIDDISELTKDSTSTAPDSQRLEPLGPGSSGRPGKEPGEPGVLDGPTLPCCYARGEEESEEDSYDPRGKGGHLRSMESNGRPASTHYYGDSDYRHGARVEKYGPGPMGPKHPSKSLAPAAISSKRSKHRKQGMEQKISKFSPIEEAKDVESDLASYPPPAVSSSLVSR
GRKFQDEITYGLKKNVYEQQKYYGMSSRDAVEDDRIYGGSSRSRAPSAYSGEKLSSHDFSGWGKGYEREREAVERLQKAGPKPSSLSMAHSRVRPPMRSQASEEESPVSPLGRPRPAGGPLPPGGDTCPQFCSSHSMPDVQEHVKDGPRAHAYKREEGYILDDSHCVVSD
SEAYHLGQEETDWFDKPRDARSDRFRHHGGHAVSSSSQKRGPARHSYHDYDEPPEEGLWPHDEGGPGRHASAKEHRHGDHGRHSGRHTGEEPGRRAAKPHARDLGRHEARPHSQPSSAPAMPKKGQPGYPSSAEYSQPSRASSAYHHASDSKKGSRQAHSGPAALQSKAE
PQAQPQLQGRQAAPGPQQSQSPSSRQIPSGAASRQPQTQQQQQGLGLQPPQQALTQARLQQQSQPTTRGSAPAASQPAGKPQPGPSTATGPQPAGPPRAEQTNGSKGTAKAPQQGRAPQAQPAPGPGPAGVKAGARPGGTPGAPAGQPGADGESVFSKILPGGAAEQAGK
LTEAVSAFGKKFSSFW
>Q9NSI6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 2320 |    Name=BRWD1; Synonyms=C21orf107, WDR9;
MAEPSSARRPVPLIESELYFLIARYLSAGPCRRAAQVLVQELEQYQLLPKRLDWEGNEHNRSYEELVLSNKHVAPDHLLQICQRIGPMLDKEIPPSISRVTSLLGAGRQSLLRTAKDCRHTVWKGSAFAALHRGRPPEMPVNYGSPPNLVEIHRGKQLTGCSTFSTAFPG
TMYQHIKMHRRILGHLSAVYCVAFDRTGHRIFTGSDDCLVKIWSTHNGRLLSTLRGHSAEISDMAVNYENTMIAAGSCDKIIRVWCLRTCAPVAVLQGHTGSITSLQFSPMAKGSQRYMVSTGADGTVCFWQWDLESLKFSPRPLKFTEKPRPGVQMLCSSFSVGGMFLA
TGSTDHVIRMYFLGFEAPEKIAELESHTDKVDSIQFCNNGDRFLSGSRDGTARIWRFEQLEWRSILLDMATRISGDLSSEEERFMKPKVTMIAWNQNDSIVVTAVNDHVLKVWNSYTGQLLHNLMGHADEVFVLETHPFDSRIMLSAGHDGSIFIWDITKGTKMKHYFNM
IEGQGHGAVFDCKFSQDGQHFACTDSHGHLLIFGFGCSKPYEKIPDQMFFHTDYRPLIRDSNNYVLDEQTQQAPHLMPPPFLVDVDGNPHPTKYQRLVPGRENSADEHLIPQLGYVATSDGEVIEQIISLQTNDNDERSPESSILDGMIRQLQQQQDQRMGADQDTIPRG
LSNGEETPRRGFRRLSLDIQSPPNIGLRRSGQVEGVRQMHQNAPRSQIATERDLQAWKRRVVVPEVPLGIFRKLEDFRLEKGEEERNLYIIGRKRKTLQLSHKSDSVVLVSQSRQRTCRRKYPNYGRRNRSWRELSSGNESSSSVRHETSCDQSEGSGSSEEDEWRSDRK
SESYSESSSDSSSRYSDWTADAGINLQPPLRTSCRRRITRFCSSSEDEISTENLSPPKRRRKRKKENKPKKENLRRMTPAELANMEHLYEFHPPVWITDTTLRKSPFVPQMGDEVIYFRQGHEAYIEAVRRNNIYELNPNKEPWRKMDLRDQELVKIVGIRYEVGPPTLC
CLKLAFIDPATGKLMDKSFSIRYHDMPDVIDFLVLRQFYDEARQRNWQSCDRFRSIIDDAWWFGTVLSQEPYQPQYPDSHFQCYIVRWDNTEIEKLSPWDMEPIPDNVDPPEELGASISVTTDELEKLLYKPQAGEWGQKSRDEECDRIISGIDQLLNLDIAAAFAGPVD
LCTYPKYCTVVAYPTDLYTIRMRLVNRFYRRLSALVWEVRYIEHNARTFNEPESVIARSAKKITDQLLKFIKNQHCTNISELSNTSENDEQNAEDLDDSDLPKTSSGRRRVHDGKKSIRATNYVESNWKKQCKELVNLIFQCEDSEPFRQPVDLVEYPDYRDIIDTPMDF
GTVRETLDAGNYDSPLEFCKDIRLIFSNAKAYTPNKRSKIYSMTLRLSALFEEKMKKISSDFKIGQKFNEKLRRSQRFKQRQNCKGDSQPNKSIRNLKPKRLKSQTKIIPELVGSPTQSTSSRTAYLGTHKTSAGISSGVTSGDSSDSAESSERRKRNRPITNGSTLSES
EVEDSLATSLSSSASSSSEESKESSRARESSSRSGLSRSSNLRVTRTRAAQRKTGPVSLANGCGRKATRKRVYLSDSDNNSLETGEILKARAGNNRKVLRKCAAVAANKIKLMSDVEENSSSESVCSGRKLPHRNASAVARKKLLHNSEDEQSLKSEIEEEELKDENQPL
PVSSSHTAQSNVDESENRDSESESDLRVARKNWHANGYKSHTPAPSKTKFLKIESSEEDSKSHDSDHACNRTAGPSTSVQKLKAESISEEADSEPGRSGGRKYNTFHKNASFFKKTKILSDSEDSESEEQDREDGKCHKMEMNPISGNLNCDPIAMSQCSSDHGCETDLD
SDDDKIEKPNNFMKDSASQDNGLSRKISRKRVCSSDSDSSLQVVKKSSKARTGLLRITRRCAATAANKIKLMSDVEDVSLENVHTRSKNGRKKPLHLACTTAKKKLSDCEGSVHCEVPSEQYACEGKPPDPDSEGSTKVLSQALNGDSDSEDMLNSEHKHRHTNIHKIDA
PSKRKSSSVTSSGEDSKSHIPGSETDRTFSSESTLAQKATAENNFEVELNYGLRRWNGRRLRTYGKAPFSKTKVIHDSQETAEKEVKRKRSHPELENVKISETTGNSKFRPDTSSKSSDLGSVTESDIDCTDNTKTKRRKTKGKAKVVRKEFVPRDREPNTKVRTCMHNQ
KDAVQMPSETLKAKMVPEKVPRRCATVAANKIKIMSNLKETISGPENVWIRKSSRKLPHRNASAAAKKKLLNVYKEDDTTINSESEKELEDINRKMLFLRGFRSWKENAQ
>Q96KE9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 485 |    Name=BTBD6; Synonyms=BDPL;
MAAELYAPASAAAADLANSNAGAAVGRKAGPRSPPSAPAPAPPPPAPAPPTLGNNHQESPGWRCCRPTLRERNALMFNNELMADVHFVVGPPGATRTVPAHKYVLAVGSSVFYAMFYGDLAEVKSEIHIPDVEPAAFLILLKYMYSDEIDLEADTVLATLYAAKKYIVPALAKACVNFLETSLEAKNACVLLSQSRLFEEPELTQRCWEVIDAQAEMALRSEGFCEIDRQTLEIIVTREALNTKEAVVFEAVLNWAEAECKRQGLPITPRNKRHVLGRALYLVRIPTMTLEEFANGAAQSDILTLEETHSIFLWYTATNKPRLDFPLTKRKGLAPQRCHRFQSSAYRSNQWRYRGRCDSIQFAVDRRVFIAGLGLYGSSSGKAEYSVKIELKRLGVVLAQNLTKFMSDGSSNTFPVWFEHPVQVEQDTFYTASAVLDGSELSYFGQEGMTEVQCGKVAFQFQCSSDSTNGTGVQGGQIPELIFYA
>P0C7T9 | Homo sapiens (Human). | NCBI_TaxID=9606; | 278 |    Name=BZW1L1;
MENSERNKLAMLTGVLLANGTLNASILNSLYNENLVKEGVSAAFAVKLFKSWINEKDINAVAASLRKVSMDNRLMELFPANKQSVEHFTKYFTEAGLKELSEYVRNQQTIGARKELQKELQEQMSRGDPFKDIILYVKEEMKKNNIPEPVVIGIVWSSVMSTVEWNKKEELVAEQAIKHLKQYSPLLAAFTTQGQSELTLLLKIQEYCYDNIHFMKAFQKIVVLFYKAEVLSEEPILKWYKDAHVAKGKSVFLEQMKKFVEWLKNAEEESESEAEEGD
>Q8IYA2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1237 |    Name=CCDC144C;
MVSWGGEKRGGAEGSPKPAVYATRKTGSVRSQEDQWYLGYPGDQWSSGFSYSWWKNSVGSESKHGEGALDQPQHDVRLEDLGELHRAARSGDVPGVEHVLVPGDTGVDKRDRKKSIQQLVPEYKEKQTPESLPQNNNPDWHPTNLTLSDETCQRSKNLKVDDKCPSVSPSMPENQSATKELGQMNLTEREKMDTGVKTSQEPEMAKDCDREDIPIYPVLPHVQKSEEMRIEQGKLEWKNQLKLVINELKQRFGEIYEKYKIPACPEEEPLLDNSTRGTDVKDIPFNLTNNIPGCEEEDASEISVSVVFETFPEQKEPSLKNIIHSYYHPYSGSQEHVCQSSSKLHLHENKLDCDNDNKPGIGHIFSTDKNFHNDASTKKARNPEVVTVEMKEDQEFDLQMTKNMNQNSDSGSTNNYKSLKPKLENLSSLPPDSDRTSEVYLHEELQQDMQKFKNEVNTLEEEFLALKKENVQLHKEVEEEMEKHRSNSTELSGTLTDGTTVGNDDDGLNQQIPRKENGEHDRLALKQENEEKRNADMLYNKDSEQLRIKEEECGKVVETKQQLKWNLRRLVKELRTVVQERNDAQKQLSEEQDARILQDQILTSKQKELEMAQKKRNPEISHRHQKEKDLFHENCMLQEEIALLRLEIDTIKNQNKQKEKKYFEDIEVVKEKNDNLQKIIKRNEETLTETILQYSGQLNNLTAENKMLNSELENGKENQERLEIEMESYRCRLAAAVHDCDQSQTARDLKLDFQRTRQEWVRLHDKMKVDMSGLQAKNEILSEKLSNAESKINSLQIQLHNTRDALGRESLILERVQRDLSQTQCQKKETEQMYQSKLKKYIAKQESVEERLSQLQSENMLLRQQLDDVHKKANSQEKTISTIQDQFHSAAKNLQAESEKQILSLQEKNKELMDEYNHLKERMDQCEKEKAGRKIDLTEAQETVPSRCLHLDAENEVLQLQQTLFSMKAIQKQCETLQKNKKQLKQEVVNLKSYMERNMLERGEAEWHKLLIEERARKEIEEKLNEAILTLQKQAAVSHEQLAQLREDNTTSIKTQMELTVIDLESEISRIKTSQADFNKTKLERYKELYLEEVKVRESLSNELSRTNEMIAEVSTQLTVEKEQTRSRSLFTAYATRPVLESPCVGNLNDSEGLNRKHIPRKKRSALKDMESYLLKMQQKLQNDLTAEVAGSSQTGLHRIPQCSSFSSSSLHLLLCSICQPFFLILQLLLNMNLDPI
>A7MBM2 | Homo sapiens (Human). | NCBI_TaxID=9606; | 1401 |    Name=DISP2; Synonyms=DISPB, KIAA1742;
MDGDSSSSSGGSGPAPGPGPEGEQRPEGEPLAPDGGSPDSTQTKAVPPEASPERSCSLHSCPLEDPSSSSGPPPTTSTLQPVGPSSPLAPAHFTYPRALQEYQGGSSLPGLGDRAALCSHGSSLSPSPAPSQRDGTWKPPAVQHHVVSVRQERAFQMPKSYSQLIAEWPVAVLMLCLAVIFLCTLAGLLGARLPDFSKPLLGFEPRDTDIGSKLVVWRALQALTGPRKLLFLSPDLELNSSSSHNTLRPAPRGSAQESAVRPRRMVEPLEDRRQENFFCGPPEKSYAKLVFMSTSSGSLWNLHAIHSMCRMEQDQIRSHTSFGALCQRTAANQCCPSWSLGNYLAVLSNRSSCLDTTQADAARTLALLRTCALYYHSGALVPSCLGPGQNKSPRCAQVPTKCSQSSAIYQLLHFLLDRDFLSPQTTDYQVPSLKYSLLFLPTPKGASLMDIYLDRLATPWGLADNYTSVTGMDLGLKQELLRHFLVQDTVYPLLALVAIFFGMALYLRSLFLTLMVLLGVLGSLLVAFFLYQVAFRMAYFPFVNLAALLLLSSVCANHTLIFFDLWRLSKSQLPSGGLAQRVGRTMHHFGYLLLVSGLTTSAAFYASYLSRLPAVRCLALFMGTAVLVHLALTLVWLPASAVLHERYLARGCARRARGRWEGSAPRRLLLALHRRLRGLRRAAAGTSRLLFQRLLPCGVIKFRYIWICWFAALAAGGAYIAGVSPRLRLPTLPPPGGQVFRPSHPFERFDAEYRQLFLFEQLPQGEGGHMPVVLVWGVLPVDTGDPLDPRSNSSLVRDPAFSASGPEAQRWLLALCHRARNQSFFDTLQEGWPTLCFVETLQRWMESPSCARLGPDLCCGHSDFPWAPQFFLHCLKMMALEQGPDGTQDLGLRFDAHGSLAALVLQFQTNFRNSPDYNQTQLFYNEVSHWLAAELGMAPPGLRRGWFTSRLELYSLQHSLSTEPAVVLGLALALAFATLLLGTWNVPLSLFSVAAVAGTVLLTVGLLVLLEWQLNTAEALFLSASVGLSVDFTVNYCISYHLCPHPDRLSRVAFSLRQTSCATAVGAAALFAAGVLMLPATVLLYRKLGIILMMVKCVSCGFASFFFQSLCCFFGPEKNCGQILWPCAHLPWDAGTGDPGGEKAGRPRPGSVGGMPGSCSEQYELQPLARRRSPSFDTSTATSKLSHRPSVLSEDLQLHDGPCCSRPPPAPASPRELLLDHQAVFSQCPALQTSSPYKQAGPSPKTRARQDSQGEEAEPLPASPEAPAHSPKAKAADPPDGFCSSASTLEGLSVSDETCLSTSEPSARVPDSVGVSPDDLDDTGQPVLERGQLNGKRDTLWLALRETVYDPSLPASHHSSLSWKGRGGPGDGSPVVLPNSQPDLPDVWLRRPSTHTSGYSS
>Q96HU8 | Homo sapiens (Human). | NCBI_TaxID=9606; | 199 |    Name=DIRAS2;
MPEQSNDYRVAVFGAGGVGKSSLVLRFVKGTFRESYIPTVEDTYRQVISCDKSICTLQITDTTGSHQFPAMQRLSISKGHAFILVYSITSRQSLEELKPIYEQICEIKGDVESIPIMLVGNKCDESPSREVQSSEAEALARTWKCAFMETSAKLNHNVKELFQELLNLEKRRTVSLQIDGKKSKQQKRKEKLKGKCVIM
>Q8N4W6 | Homo sapiens (Human). | NCBI_TaxID=9606; | 341 |    Name=DNAJC22;
MAKGLLVTYALWAVGGPAGLHHLYLGRDSHALLWMLTLGGGGLGWLWEFWKLPSFVAQANRAQGQRQSPRGVTPPLSPIRFAAQVIVGIYFGLVALISLSSMVNFYIVALPLAVGLGVLLVAAVGNQTSDFKNTLGSAFLTSPIFYGRPIAILPISVAASITAQRHRRYKALVASEPLSVRLYRLGLAYLAFTGPLAYSALCNTAATLSYVAETFGSFLNWFSFFPLLGRLMEFVLLLPYRIWRLLMGETGFNSSCFQEWAKLYEFVHSFQDEKRQLAYQVLGLSEGATNEEIHRSYQELVKVWHPDHNLDQTEEAQRHFLEIQAAYEVLSQPRKPWGSRR

Desired Output

主要问题

我觉得我的原始代码虽然颠倒了,但更可靠,因为我没有必要告诉它多少次打印标题,它只是寻找并打印出自己唯一的标题。是否有更好的方法只打印标题的新实例,但打印后面所需序列的所有匹配?我找不到指定打印唯一匹配的方法,我不确定是否尝试将所有标题和匹配区域发送到哈希(我不知道该怎么做)。

2 个答案:

答案 0 :(得分:1)

这就是我要写的内容。

我做了几处重大改变

  • 我使用list从下一个序列标题的开头删除chomp,然后检查是否还有一些非空格字符。第一个记录读取只是>,因此这将丢弃空记录

  • 我已从您的正则表达式模式中删除了分号,因为它们没有太大区别,并添加了一些可选的空格来修剪标题上的任何前导和尾随空格

    < / LI>
  • 我已经从>删除了非贪婪修饰符?,因为我没有看到它的原因。也许我错了。我也改变了它,以便即使它们重叠也能找到所有匹配的序列。再一次,也许这是一个糟糕的电话:我不是生物信息学家!

  • 将每个区域的位置计算为[VILMFWCA]{8,}是错误的,因为它取决于匹配的长度。我使用了内置数组pos() - 7@-包含捕获开始的位置$-[1]$1包含$-[2]等。 $2是整场比赛的位置

  • 我将匹配区域及其起始位置保留在数组$-[0]中。搜索完成后,我可以通过检查

  • 的大小来测试是否找到了
@regions

输出

use strict;
use warnings 'all';

my $FASTA_FILE = 'test.fasta';


open my $fh, '<', $FASTA_FILE or die qq{Unable to open "$FASTA_FILE" for input: $!};

local $/ = '>';

while ( <$fh> ) {
    chomp;
    next unless /\S/;

    next unless my ( $head, $seq ) = /\s*(.*\S)\s*\n(\w+)/;

    my @regions;

    while ( $seq =~ / (?= ( [VILMFWCA]{8,} ) ) /gxi ) {
        push @regions, [ $1, $-[1] ];
    }

    next unless @regions; # Skip this sequence if no regions found

    printf "%d Hydrophobic region%s found in %s\n",
            scalar @regions,
            @regions == 1 ? "" : "s",
            $head;

    printf "    %s found at %d\n", @$_ for @regions;
}

答案 1 :(得分:0)

我所做的是从序列匹配if语句正则表达式中删除全局修饰符,但将修饰符保留在while语句正则表达式之后。这样我就不像以前那样输掉第一场比赛,但是我仍然可以在序列匹配之前打印标题。

    unless (open FILE, "<", '/scratch/SampleDataFiles/test.fasta') {
    die "Cannot Open File", $!;
    }

    $/ = ">";       

    my @file = <FILE>;
    my $file = "@file";
    chomp $file;

    my $count = 0;    
    my $sequence_count = 0;   

    foreach $file (@file) {

               if ($file =~ /(.*;.*;?\n)(\w+)/) {

                   my $head = $1;
                   my $sequence = $2;
                   $sequence_count = $sequence_count +1;

                   if ($sequence =~ /([VILMFWCA]{8,}?)/i) {

                            print "\n", "Hydrophobic region(s) found in ", $head, "\n";
                            $count = $count +1;

                   }


                    while ($sequence =~ /([VILMFWCA]{8,}?)/gi) {

                            my $pos = pos($sequence)-7;
                            print $1, " found at ", $pos, "\n", "\n";

                    }
               }
        }


print "\n", "\n", "-------------------------", "\n", "Hydrophobic region(s) 
found in ", $count, " out of ", $sequence_count , " sequences.", "\n", "\n";



close FILE;

感谢您的帮助,伙计们!