查找并打印给定字符串中的特定模式

时间:2016-10-16 22:30:54

标签: python perl

我正在编写代码,使用python或perl在给定字符串中查找特定模式。我在使用C找到模式方面取得了一些成功,但python或perl的使用对于这项任务是强制性的,而且我在这两种语言中都是新的。

我的字符串看起来像这样(氨基酸序列): -

 MKTSGNQDEILVIRKGWLTINNIGIMKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFAL

我想找的模式是

 KXXXXXX(K\R)XR 

请注意,K和K \ R之间的字母不固定。但是,K \ R和R之间只有一个字母。所以,在给定的字符串中,我的模式是这样的,存在于字母no之间。基于“最小模式”搜索,54到65(如果我计算正确): -

  KYMLSVDNLKLR

以前,我使用C if-else条件来破解这个给定的字符串并打印出字数(不完全成功)。

   printf(%c, word[i]);
     if ((word [i] == 'K' || word [i] == 'R' )) && word [i+2] == 'R') {
        printf("\n");
        printf("%d\n",i);
    }

我同意它捕获一切。如果有人能帮我解决这个问题,那就太好了。

3 个答案:

答案 0 :(得分:0)

无论语言如何,这看起来都适合正则表达式。

这是一个如何在python中执行正则表达式的示例。如果你想要匹配开始的索引,你可以这样做:

m = re.search(r'K(?:[A-JL-Z]+?|K)[KR][A-Z]R', s)
print m.start()  # prints index
print m.group()  # prints matching string

或者@bunji指出,你也可以使用finditer

for m in re.finditer(r'K(?:[A-JL-Z]+?|K)[KR][A-Z]R', s):
    print m.start()  # prints index
    print m.group()  # prints matching string

答案 1 :(得分:0)

你说你希望比赛不贪婪,但这没有意义。我想你正试图找到最小的匹配。如果是这样,那很难做到。这是你需要的正则表达式匹配:

/
    K
    (?: (?: [^KR] | R(?!.R) )+
    |   .
    )
    [KR]
    .
    R
/sx

然而,如果有错误,我不会感到惊讶。找到最小匹配的唯一可靠方法是找到所有可能的匹配。

my $match;
while (/(?= ( K.+[KR].R ) )/sxg) {
    if (!defined($match) || length($1) > length($match)) {
        $match = $1;
    }
}

但这会慢得多,特别是对于长琴弦。

答案 2 :(得分:-1)

只是这样做,因为我讨厌在正则表达式中追溯。但是如果我首先执行比赛中最严格的部分,我确实发现它通常更快。通过反转输入和搜索模式,在这种情况下更简单。这应该在第一次(最短)可能的比赛中停止;而不是找到最长的匹配,然后寻找最短的。

#!/usr/bin/perl
use strict;
use warnings;

my $pattern = "MKTSGNQDEILVIRKGWLTINNIGIMKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFAL";
my $reverse = reverse $pattern;
my $length  = length $reverse;
if( $reverse =~ /(R.[KR][^K]+K)/ ) {
    my $match   = $1;
    $match      = reverse $match;
    my $start_p = $length-$+[0];
    my $end_p   = $length-$-[0]-1;
    my $where   = $start_p + length $match;
    print "FOUND ...\n";
    print "0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890\n";
    print $pattern."\n";
    printf "%${where}s\n", $match;
    print "Found pattern '$match' starting at position '$start_p' and ending at position '$end_p'\n";
    # test it
    if( $pattern =~ /$match/ ) {
        if( $start_p == $-[0] && $end_p == $+[0]-1 ) {
            print "Test successful, match found in original pattern.\n";
        } else {
            print "Test failed, you screwed something up!\n";
        }
    } else {
        print "Hmmm, pattern '$match' wasn't found in '$pattern'?\n";
    }
} else {
    print "Dang, no match was found!\n";
}

我不确定此处消除反向跟踪是否会超过反转的性能损失。我想这很大程度上取决于输入字符串的大小和可能匹配的长度。

$> perl ./search.pl
FOUND ...
0123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890
MKTSGNQDEILVIRKGWLTINNIGIMKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFAL
                                                     KYMLSVDNLKLR
Found pattern 'KYMLSVDNLKLR' starting at position '53' and ending at position '64'
Test successful, match found in original pattern.

我向那些不明白为什么我从零开始的人道歉。

还有一个更真实的例子 - 它会找到交织在一起的比赛。

#!/usr/bin/perl
use strict;
use warnings;
# NOTE THE INPUT WAS MODIFIED FROM OP
my $input = "MKTSGNQDEILVIRKKRKRRGWKLTINNIRGRIMRGRKGGSKEYWFVLTAENLSWYKDDEEKEKKYMLSVDNLKLRDVEKGFMSSKHIFALKGR";

my $rstart = length $input;
my( $match, $start, $end ) = rsearch( $input, "R.[KR].+?K" );
while( $match ) {
    print "Found matching pattern '$match' starting at offset '$start' and ending at offset $end\n";
    $input = substr $input, 0, $end;
    ( $match, $start, $end ) = rsearch( $input, "R.[KR].+?K" );
}
exit(0);

sub rsearch {
    my( $input, $pattern ) = @_;
    my $reverse = reverse $input;

    if( $reverse =~ /($pattern)/ ) {
        my $length = length $reverse;
        $match = reverse $1;
        $start = $length-$+[0];
        $end   = $length-$-[0]-1;
        return( $match, $start, $end );
    }

    return( undef );
}

perl ./search.pl
Found matching pattern 'KHIFALKGR' starting at offset '85' and ending at offset 93
Found matching pattern 'KYMLSVDNLKLR' starting at offset '64' and ending at offset 75 
Found matching pattern 'KLTINNIRGRIMRGR' starting at offset '22' and ending at offset 36
Found matching pattern 'KLTINNIRGR' starting at offset '22' and ending at offset 31
Found matching pattern 'KRKRR' starting at offset '15' and ending at offset 19 
Found matching pattern 'KKRKR' starting at offset '14' and ending at offset 18
Found matching pattern 'KTSGNQDEILVIRKKR' starting at offset '1' and ending at offset 16