使用perl在多个蛋白质序列中找到Palindromes(完美的回文)

时间:2014-08-27 06:05:39

标签: regex algorithm perl bioinformatics palindrome

我是Perl的新手(正则表达式)。我需要一个例子来说明如何编写一个程序来找出多个蛋白质序列中的回文(完美)(让它成为4个序列,数量为200个氨基酸,在文件中)我必须过滤掉,回文和回顾序列中存在的回文的位置。

>TRE|Q47404|Q47404 (409 AA) Glycosyl transferase [Escherichia coli]
MIFDASLKKLRKLFVNPIGFFRDSWFFNSKNKAEELLSPLKIKSKNIFIVAHLGQLKKAE
LFIQKFSRRSNFLIVLATKKNTEMPRLILEQMNKKLFSSYKLLFIPTEPNTFSLKKVIWF
YNVYKYIVLNSKAKDAYFMSYAQHYAIFIWLFKKNNIRCSLIEEGTGTYKTEKKKPLVNI
NFYSWIINSIILFHYPDLKFENVYGTFPNLLKEKFDAKKIFEFKTIPLVKSSTRMDNLIH

>TRE|O06435|O06435 (492 AA) SynE [Neisseria meningitidis]
MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
IFVSQRYPVSDEVYYKTIVETLNQMSLRIEGKIFIKLHPKEMENKNIMSLFLNMVTINPR

>TRE|Q8VRL9|Q8VRL9 (492 AA) SiaD [Neisseria meningitidis]
MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY

我需要在这个以及他们的位置输出完美的回文。 我已经阅读了很多文章,但无法获得更好的想法。请为我推荐一些技巧和程序。

2 个答案:

答案 0 :(得分:1)

此挑战需要三种正则表达式功能:

  1. perlretut - Recursive Patterns - 查找回文

  2. perlretut - Positive Lookahead Assertions - 查找重叠的匹配

  3. perlretut - Position Information - 确定匹配项在字符串中的位置。

  4. 将这些结果放在一起得出结果:

    use strict;
    use warnings;
    
    my $pp = qr/(?: (\w) (?1) \g{-1} | \w? )/ix;
    
    local $/ = '';
    
    while (<DATA>) {
        chomp;
        my ($header, @lines) = split "\n";
        my $data = join '', @lines;
    
        print "$header\n$data\n";
    
        while ($data =~ /(?=($pp))/g) {
            print "$-[0] - $1\n" if length($1) > 2;
        }
    }
    
    __DATA__
    >TRE|Q47404|Q47404 (409 AA) Glycosyl transferase [Escherichia coli]
    MIFDASLKKLRKLFVNPIGFFRDSWFFNSKNKAEELLSPLKIKSKNIFIVAHLGQLKKAE
    LFIQKFSRRSNFLIVLATKKNTEMPRLILEQMNKKLFSSYKLLFIPTEPNTFSLKKVIWF
    YNVYKYIVLNSKAKDAYFMSYAQHYAIFIWLFKKNNIRCSLIEEGTGTYKTEKKKPLVNI
    NFYSWIINSIILFHYPDLKFENVYGTFPNLLKEKFDAKKIFEFKTIPLVKSSTRMDNLIH
    
    >TRE|O06435|O06435 (492 AA) SynE [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
    LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
    QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
    LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
    IFVSQRYPVSDEVYYKTIVETLNQMSLRIEGKIFIKLHPKEMENKNIMSLFLNMVTINPR
    
    >TRE|Q8VRL9|Q8VRL9 (492 AA) SiaD [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNN
    LLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTI
    QPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNN
    LHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
    

    输出:

    >TRE|Q47404|Q47404 (409 AA) Glycosyl transferase [Escherichia coli]
    MIFDASLKKLRKLFVNPIGFFRDSWFFNSKNKAEELLSPLKIKSKNIFIVAHLGQLKKAELFIQKFSRRSNFLIVLATKKNTEMPRLILEQMNKKLFSSYKLLFIPTEPNTFSLKKVIWFYNVYKYIVLNSKAKDAYFMSYAQHYAIFIWLFKKNNIRCSLIEEGTGTYKTEKKKPLVNINFYSWIINSIILFHYPDLKFENVYGTFPNLLKEKFDAKKIFEFKTIPLVKSSTRMDNLIH
    6 - LKKL
    29 - KNK
    40 - KIK
    42 - KSK
    46 - IFI
    66 - SRRS
    86 - LIL
    123 - YKY
    131 - KAK
    146 - IFI
    164 - GTG
    165 - TGT
    172 - KKK
    178 - NIN
    211 - KEK
    220 - FEF
    >TRE|O06435|O06435 (492 AA) SynE [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNNLLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTIQPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNNLHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDYIFVSQRYPVSDEVYYKTIVETLNQMSLRIEGKIFIKLHPKEMENKNIMSLFLNMVTINPR
    26 - FSSF
    55 - KLK
    70 - MKM
    114 - KLLK
    135 - SLLS
    137 - LSL
    154 - TAT
    205 - NAN
    220 - STS
    222 - SQS
    271 - KIFIK
    272 - IFI
    280 - EME
    283 - NKN
    289 - LFL
    >TRE|Q8VRL9|Q8VRL9 (492 AA) SiaD [Neisseria meningitidis]
    MLQKIRKALFHPKKFFQDSQWFATPLFSSFAPKSNLFIISTFAQLNQAHSLTKMQKLKNNLLVILYTTQNMKMPKLIQKSVDKELFSVTYMFELPRKPGIVSPKKFLYIQRGYKKLLKTIQPAHLYVMSFAGHYSSLLSLAKKMNITTHLVEEGTATYAPLLESFTYKPTKFEQRFVGNNLHQKGYFDKFDILHVAFPEYAKKIFNANEYHRFFAHSGGISTSQSIAKIQDKYRISQNDY
    26 - FSSF
    55 - KLK
    70 - MKM
    114 - KLLK
    135 - SLLS
    137 - LSL
    154 - TAT
    205 - NAN
    220 - STS
    222 - SQS
    

答案 1 :(得分:0)

x="abaasdasdusduhfikliilkjhgjhgjhgh"

def checkpalindrome(str,i):
    if len(str)>2:
        rev=str[::-1]
        if rev==str:
            print i,":",str
i=0
for l in x:
    str=""
    k=i
    while k < len(x):
        str=str+x[k]
        checkpalindrome(str,i)
        k=k+1
    i=i+1

这将创建所有字符串组合并将其传递给回文函数。