Perl正则表达式 - 行必须包含ADFHKMPRTWCEGILNQSVY而不包含任何其他内容

时间:2014-10-28 10:09:02

标签: regex string perl

有人可以帮助我吗 - 我需要一个正则表达式,它只匹配包含字符ADFHKMPRTWCEGILNQSVY和NOTHING else的行。

我需要遍历看起来像这样的文本行:

>gi|46450118|gb|AAS96767.1| femAB family protein [Desulfovibrio vulgaris str. Hildenborough]
MVDLSRKKTQALLPTDILFQTPYWAQVKTRLGMESHAFDIRSSGPWGDVLVLLRRFGRHRVAIVPQGPEV
APPHEDYGVYLESFSLALAEGLGPDVAFIRYDLPWVSPYADEMHDEGWNAFPEARLRELRMNMGTRHWNL
RKSFQDLTVASSLVVDITGEEAAVLERMKPKTRYNIGLARRKGVAVREVGRESLPQFHALYRQTAIRNGF
EPCSITHFSAMFHALCDGAGSTELLFLLATHGTDILAGCIVGLAGRTANFLYGASGNVKRNLMAPYLMHW
TAMCHARDRGCHDYEMGAVPPGHDPAHPFHGLYRFKTGFGGRVALRSGSWDYPLDHAAYRDFCNAESLYR
TDAAPGRTQ

>gi|46450117|gb|AAS96766.1| iron-sulfur protein CooF [Desulfovibrio vulgaris str. Hildenborough]
MNHEELFVIQAEAEKCRACRKCELACIASHNNLTIKEAAKKRTVFAPRVHVVKTDEVKMPVQCRQCKDAP
CARVCPTRALVQDDGVVTMRAQFCAACRLCIMACPYGAISLSFIGLPEEDEAGAMHGREVAVRCDLCSEW
RAREGKSSCACVEACPTKALHMVPLAEARGRHQ

>gi|46450116|gb|AAS96765.1| hydrogenase nickel insertion protein HypA [Desulfovibrio vulgaris str. Hildenborough]
MHEASIVAGIMRIVEEEAARHDVTRIARVRLRVGLLTGVEPRTLTACFELYSEGTVAEGASLDLETVPAL
GTCHACGATFDLHRRCFACPTCGNDDITLEGGRELTIAGLEVPQPEGATA

>gi|46450115|gb|AAS96764.1| carbon monoxide-induced hydrogenase CooH, putative [Desulfovibrio vulgaris str. Hildenborough]
MSTPDSTTQTWTLPVGPLHVALEEPMYFKLDVDGEIVRNVEITAGHVHRGMEALAMRRNLFQNIVLTERV
CSLCSNSHPFTYCMAVEHLAGIEVPARADHLRVVAEEIKRTASHLFNVAILAHIIGFKSLFMHVMEVREI
MQDIKETVYGNRMDLAANCIGGVKYDVDAELLAMLLAGLDKVERNAREIYRIYASDPMVTGRTTGIGVLP
PDEARRFGVVGPVARGSGLAVDVRRDVPYAAYPQLSFDVITEEGCDVRARALVRLREVFESISIIRQCVA
TLPEGAMTVIMPEIPAGQSVARSEAPRGELMYYLRTDGTDIPNRLKWRVPSYMNWDALGVMMRDANVADI
PLIVNSIDPCISCTER

>gi|46450114|gb|AAS96763.1| hydrogenase, CooU subunit, putative [Desulfovibrio vulgaris str. Hildenborough]
MPDNALTAPLATALDALAEAEGFTWTRDAHGNAYGWLRLAERDTLPEAARLLAEGGARLATVTAYDPVRE
PGVPRQEIAYHFDVHGTTLTVTVVLDPECPSVPSITPHFRNADWNEREFMEMYDIAVPGHPNPRRLFLDE
KLDAGIMNTIIPLSTMTNGASTQNLWERILAARPGDKA

>gi|46450113|gb|AAS96762.1| hydrogenase, CooX subunit, putative [Desulfovibrio vulgaris str. Hildenborough]
MFGFLKVLARNVLKGPSTDPFPFAEAHTPARFRGQVRLDPALCVGCAICHHVCAGGAINIAEREDGSGYD
FTVWHNTCALCGLCRHYCPTGAITLSNDWHNAHLQSQKYDWCERQFVPFMQCEGCGAHIRPLPPQLAARA
YGPGGFDFASFMRLCPSCRQLAAARADVHIPEASAMPAAPAGHADEPAIREGDATAVTVKGDETPATGVQ
Q

它们都以>开头,所以我可以寻找它。但是,我想确保我得到正确的行,所以我也想要一个与包含ADFHKMPRTWCEGILNQSVY的行相匹配的正则表达式。

干杯,

的Stefan

3 个答案:

答案 0 :(得分:5)

像这样的东西

/^[ADFHKMPRTWCEGILNQSVY]+$/

答案 1 :(得分:2)

您只需要构造一个正则表达式,在行的开头和结尾之间允许任意数量的任何字符。这是一个示例脚本:

use strict;
use warnings;

while (<DATA>) {
    if (/^[ADFHKMPRTWCEGILNQSVY]+$/) {
        print $_;
    }
}

__DATA__
>gi|46450118|gb|AAS96767.1| femAB family protein [Desulfovibrio vulgaris str. Hildenborough]
MVDLSRKKTQALLPTDILFQTPYWAQVKTRLGMESHAFDIRSSGPWGDVLVLLRRFGRHRVAIVPQGPEV
APPHEDYGVYLESFSLALAEGLGPDVAFIRYDLPWVSPYADEMHDEGWNAFPEARLRELRMNMGTRHWNL
RKSFQDLTVASSLVVDITGEEAAVLERMKPKTRYNIGLARRKGVAVREVGRESLPQFHALYRQTAIRNGF
EPCSITHFSAMFHALCDGAGSTELLFLLATHGTDILAGCIVGLAGRTANFLYGASGNVKRNLMAPYLMHW
TAMCHARDRGCHDYEMGAVPPGHDPAHPFHGLYRFKTGFGGRVALRSGSWDYPLDHAAYRDFCNAESLYR
TDAAPGRTQ

输出:

MVDLSRKKTQALLPTDILFQTPYWAQVKTRLGMESHAFDIRSSGPWGDVLVLLRRFGRHRVAIVPQGPEV
APPHEDYGVYLESFSLALAEGLGPDVAFIRYDLPWVSPYADEMHDEGWNAFPEARLRELRMNMGTRHWNL
RKSFQDLTVASSLVVDITGEEAAVLERMKPKTRYNIGLARRKGVAVREVGRESLPQFHALYRQTAIRNGF
EPCSITHFSAMFHALCDGAGSTELLFLLATHGTDILAGCIVGLAGRTANFLYGASGNVKRNLMAPYLMHW
TAMCHARDRGCHDYEMGAVPPGHDPAHPFHGLYRFKTGFGGRVALRSGSWDYPLDHAAYRDFCNAESLYR
TDAAPGRTQ

解构正则表达式,我们有:

  • ^匹配字符串的开头
  • [ADFHKMPRTWCEGILNQSVY]匹配方括号中的任何字符
  • [ADFHKMPRTWCEGILNQSVY]+表示匹配1次或更多次
  • $匹配字符串的结尾

答案 2 :(得分:0)

我的代码:

#!/usr/bin/perl

while (<>)
{

    if (/[ADFHKMPRTWCEGILNQSVY]/ and !/[0-9a-z>:;+-,.]/)
    {
        chomp;

        for ($i = 0; $i < length($_); $i++)
        {

            if (substr($_,$i,1) eq "A")
            {
            $aminoacids{A}++;
            }
            elsif (substr($_,$i,1) eq "D")
            {
            $aminoacids{D}++;
            }
            elsif (substr($_,$i,1) eq "F")
            {
            $aminoacids{F}++;
            }
            elsif (substr($_,$i,1) eq "H")
            {
            $aminoacids{H}++;
            }
            elsif (substr($_,$i,1) eq "K")
            {
            $aminoacids{K}++;
            }
            elsif (substr($_,$i,1) eq "M")
            {
            $aminoacids{M}++;
            }
            elsif (substr($_,$i,1) eq "P")
            {
            $aminoacids{P}++;
            }
            elsif (substr($_,$i,1) eq "R")
            {
            $aminoacids{R}++;
            }
            elsif (substr($_,$i,1) eq "T")
            {
            $aminoacids{T}++;
            }
            elsif (substr($_,$i,1) eq "W")
            {
            $aminoacids{W}++;
            }
            elsif (substr($_,$i,1) eq "C")
            {
            $aminoacids{C}++;
            }
            elsif (substr($_,$i,1) eq "E")
            {
            $aminoacids{E}++;
            }
            elsif (substr($_,$i,1) eq "G")
            {
            $aminoacids{G}++;
            }
            elsif (substr($_,$i,1) eq "I")
            {
            $aminoacids{I}++;
            }
            elsif (substr($_,$i,1) eq "L")
            {
            $aminoacids{L}++;
            }
            elsif (substr($_,$i,1) eq "N")
            {
            $aminoacids{N}++;
            }
            elsif (substr($_,$i,1) eq "Q")
            {
            $aminoacids{Q}++;
            }
            elsif (substr($_,$i,1) eq "S")
            {
            $aminoacids{S}++;
            }
            elsif (substr($_,$i,1) eq "V")
            {
            $aminoacids{V}++;
            }
            elsif (substr($_,$i,1) eq "Y")
            {
            $aminoacids{Y}++;
            }
            else
            {
            print "BAD AMINO ACID  $i  ", substr($_,$i,1), "  ", $_, "\n";
            }
        }


    }


}

foreach $key (keys %aminoacids)
{
print "$key -> $aminoacids{$key}\n";
}