在字符串和记录位置中查找多个子字符串

时间:2009-05-07 12:43:43

标签: perl

以下是在字符串中查找连续子字符串的脚本。

use strict;
use warnings;

my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");

#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";

my $sequence;

while (my $line = <DAT>) {
    if ($line=~ /(HDWFLSFKD)/g){
        {
        print "its found index location: ",
        pos($line), "-",  pos($line)+length($1), "\n";        
        }
        if ($line=~ /(HD)/g){
                print "motif found and its locations is: \n";
                pos($line), "-", pos($line)+length($1), "\n\n";
                }
                if ($line=~ /(K)/g){
                        print "motif found and its location is: \n";
                        pos($line), "-",pos($line)+length($1), "\n\n";
                        }
                        if ($line=~ /(DD)/g){
                                print "motif found and its location is: \n";
                                pos($line), "-", pos($line)+length($1), "\n\n";
                                }
}else {
        $sequence .= $line;
        print "came in else\n";
    }
}

它将substring1与string匹配,并打印出substring1匹配的位置。问题在于找到其余的子串。对于substrings2,它从字符串的开头再次开始(而不是从找到substring1的位置开始)。问题是,每次计算位置时,它都从字符串的开头开始,而不是从先前找到的子字符串的位置开始。由于子串是连续的substring1,substring2,substring3,substring4,它们的位置必须分别出现在前一个之后。

5 个答案:

答案 0 :(得分:2)

试试这个perl程序

use strict;
use warnings;
use feature qw'say';

my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");

my @regex = qw(
  HDWFLSFKD
  HD
  K
  DD
);

my $sequence;

while( my $line = <$dat> ){
  chomp $line;

  say 'Line: ', $.;

  # reset the position of variable $line
  # pos is an lvalue subroutine
  pos $line = 0;

  for my $regex ( @regex ){
    $regex = quotemeta $regex;

    if( scalar $line =~ / \G (.*?) ($regex) /xg ){
      say $regex, ' found at location (', $-[2], '-', $+[2], ')';
      if( $1 ){
        say "    but skipped: \"$1\" at location ($-[1]-$+[1])";
      }
    }else{
      say 'Unable to find ', $regex;

      # end loop
      last;
    }
  }
}

答案 1 :(得分:1)

你真的应该阅读

如果您需要这些职位,您需要特殊变量@ - 和@ +。无需亲自尝试计算它们。

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( each_array );

my $source = 'AAAA   BBCCC   DD  E      FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );



if ( $source =~ /$pattern/ ) {
    my $it = each_array @-, @+;

    $it->(); # discard overall match information;

    while ( my ($start, $end) = $it->() ) {
        printf "Start: %d - Length: %d\n", $start, $end - $start;
    }
}

Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5

答案 2 :(得分:1)

我不是perl专家,但您可以使用$ - 和$ +来跟踪找到的最后一个正则表达式匹配的索引位置。
下面是代码之上构建的代码,用于解释此问题。

use strict;
use warnings;


my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");

open (OUTPUTFILE, '>data.txt');

my $sequence;
my $someVar = 0;
my $sequenceNums = 1;

my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";

while (my $line = <DAT>) 
{
    $someVar = 0;
    print "\nSequence $sequenceNums: $line\n";
    print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
        if ($line=~ /$motif1/g)
        {
                &printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
                $someVar = 1;
        }


        if ($line=~ /$motif2/g and $someVar == 1)
        {
                &printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
                $someVar = 2;
        }

        if ($line=~ /$motif3/g and $someVar == 2)
        {
                &printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
                $someVar = 3;
        }

        if ($line=~ /$motif4/g and $someVar == 3)
        {
                &printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
        }

        else 
        {
            $sequence .= $line;

            if ($someVar == 0)
            {
                &printWrongStuff($sequenceNums, "motif1", $motif1);
            }
            elsif ($someVar == 1)
            {
            &printWrongStuff($sequenceNums, "motif2", $motif2);
            }
            elsif ($someVar == 2)
            {
            &printWrongStuff($sequenceNums, "motif3", $motif3);
            }
            elsif ($someVar == 3)
            {
            &printWrongStuff($sequenceNums, "motif4", $motif4);
            }
        }
        $sequenceNums++;
}

sub printStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
            print OUTPUTFILE "Sequence: $_[0]  $_[1]: $_[2] index location: $_[3]\n";
}

sub printWrongStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
            print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";    

}

close (OUTPUTFILE);
close (DAT);

示例输入:

MLTSHQKKF HDWFLSFKD SNNYN HD 取值ķ QNHSIK DD IFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSK HDWFLSFKD QNHSIKDIFNRFNHYIYNDL

答案 3 :(得分:0)

这样的构造的结果
$line=~ /(HD)/g

是一个列表。使用while逐步完成点击。

答案 4 :(得分:0)

要匹配最后一场比赛停止的位置,请使用\Gperldoc perlre说(但请先查阅自己安装的版本手册):

  

“\ G”断言可用于   链全局匹配(使用“m // g”),   如“Regexp Quote-Like”中所述   运营商“在perlop。它也是   写“lex”时很有用   扫描仪,当你有几个   您想要匹配的模式   反对你的后续子串   字符串,请参阅上一个参考。   “\ G”的实际位置   匹配也可以通过使用来影响   “pos()”作为左值:见“pos”in   perlfunc。请注意规则   零长度匹配被修改   在某种程度上,左边的内容   “\ G”的时间不计算在内   确定比赛的长度。   因此以下内容不匹配   永远:

$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
    print $&;
}