试试这个perl程序

Question

以下是在字符串中查找连续子字符串的脚本。

use strict;
use warnings;

my $file="Sample.txt";
open(DAT, $file) || die("Could not open file!");

#worry about these later
#my $regexp1 = "motif1";
#my $regexp2 = "motif2";
#my $regexp3 = "motif3";
#my $regexp4 = "motif4";

my $sequence;

while (my $line = <DAT>) {
    if ($line=~ /(HDWFLSFKD)/g){
        {
        print "its found index location: ",
        pos($line), "-",  pos($line)+length($1), "\n";        
        }
        if ($line=~ /(HD)/g){
                print "motif found and its locations is: \n";
                pos($line), "-", pos($line)+length($1), "\n\n";
                }
                if ($line=~ /(K)/g){
                        print "motif found and its location is: \n";
                        pos($line), "-",pos($line)+length($1), "\n\n";
                        }
                        if ($line=~ /(DD)/g){
                                print "motif found and its location is: \n";
                                pos($line), "-", pos($line)+length($1), "\n\n";
                                }
}else {
        $sequence .= $line;
        print "came in else\n";
    }
}

它将substring1与string匹配，并打印出substring1匹配的位置。问题在于找到其余的子串。对于substrings2，它从字符串的开头再次开始（而不是从找到substring1的位置开始）。问题是，每次计算位置时，它都从字符串的开头开始，而不是从先前找到的子字符串的位置开始。由于子串是连续的substring1，substring2，substring3，substring4，它们的位置必须分别出现在前一个之后。

Answer 1

试试这个perl程序

use strict;
use warnings;
use feature qw'say';

my $file="Sample.txt";
open( my $dat, '<', $file) || die("Could not open file!");

my @regex = qw(
  HDWFLSFKD
  HD
  K
  DD
);

my $sequence;

while( my $line = <$dat> ){
  chomp $line;

  say 'Line: ', $.;

  # reset the position of variable $line
  # pos is an lvalue subroutine
  pos $line = 0;

  for my $regex ( @regex ){
    $regex = quotemeta $regex;

    if( scalar $line =~ / \G (.*?) ($regex) /xg ){
      say $regex, ' found at location (', $-[2], '-', $+[2], ')';
      if( $1 ){
        say "    but skipped: \"$1\" at location ($-[1]-$+[1])";
      }
    }else{
      say 'Unable to find ', $regex;

      # end loop
      last;
    }
  }
}

Answer 2

你真的应该阅读

如果您需要这些职位，您需要特殊变量@ - 和@ +。无需亲自尝试计算它们。

#!/usr/bin/perl

use strict;
use warnings;

use List::MoreUtils qw( each_array );

my $source = 'AAAA   BBCCC   DD  E      FFFFF';
my $pattern = join '\s*', map { "($_+)" } qw( A B C D E F );



if ( $source =~ /$pattern/ ) {
    my $it = each_array @-, @+;

    $it->(); # discard overall match information;

    while ( my ($start, $end) = $it->() ) {
        printf "Start: %d - Length: %d\n", $start, $end - $start;
    }
}

Start: 0 - Length: 4
Start: 7 - Length: 2
Start: 9 - Length: 3
Start: 15 - Length: 2
Start: 19 - Length: 1
Start: 26 - Length: 5

Answer 3

我不是perl专家，但您可以使用$ - 和$ +来跟踪找到的最后一个正则表达式匹配的索引位置。
下面是代码之上构建的代码，用于解释此问题。

use strict;
use warnings;


my $file="sample.txt";
open(DAT, $file) || die("Could not open file!");

open (OUTPUTFILE, '>data.txt');

my $sequence;
my $someVar = 0;
my $sequenceNums = 1;

my $motif1 = "(HDWFLSFKD)";
my $motif2 = "(HD)";
my $motif3 = "(K)";
my $motif4 = "(DD)";

while (my $line = <DAT>) 
{
    $someVar = 0;
    print "\nSequence $sequenceNums: $line\n";
    print OUTPUTFILE "\nSequence $sequenceNums: $line\n";
        if ($line=~ /$motif1/g)
        {
                &printStuff($sequenceNums, "motif1", $motif1, "$-[0]-$+[0]");
                $someVar = 1;
        }


        if ($line=~ /$motif2/g and $someVar == 1)
        {
                &printStuff($sequenceNums, "motif2", $motif2, "$-[0]-$+[0]");
                $someVar = 2;
        }

        if ($line=~ /$motif3/g and $someVar == 2)
        {
                &printStuff($sequenceNums, "motif3", $motif4, "$-[0]-$+[0]");
                $someVar = 3;
        }

        if ($line=~ /$motif4/g and $someVar == 3)
        {
                &printStuff($sequenceNums, "motif4", $motif4, "$-[0]-$+[0]");
        }

        else 
        {
            $sequence .= $line;

            if ($someVar == 0)
            {
                &printWrongStuff($sequenceNums, "motif1", $motif1);
            }
            elsif ($someVar == 1)
            {
            &printWrongStuff($sequenceNums, "motif2", $motif2);
            }
            elsif ($someVar == 2)
            {
            &printWrongStuff($sequenceNums, "motif3", $motif3);
            }
            elsif ($someVar == 3)
            {
            &printWrongStuff($sequenceNums, "motif4", $motif4);
            }
        }
        $sequenceNums++;
}

sub printStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] index location: $_[3] \n";
            print OUTPUTFILE "Sequence: $_[0]  $_[1]: $_[2] index location: $_[3]\n";
}

sub printWrongStuff
{
            print "Sequence: $_[0] $_[1]: $_[2] was not found\n";
            print OUTPUTFILE "Sequence: $_[0] $_[1]: $_[2] was not found\n";    

}

close (OUTPUTFILE);
close (DAT);

示例输入：

MLTSHQKKF HDWFLSFKD SNNYN HD 取值ķ QNHSIK DD IFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSKQNHSIKDIFNRFNHYIYNDLGIRTIA MLTSHQKKFSNNYNSK HDWFLSFKD QNHSIKDIFNRFNHYIYNDL

Answer 4

像

这样的构造的结果

$line=~ /(HD)/g

是一个列表。使用while逐步完成点击。

Answer 5

要匹配最后一场比赛停止的位置，请使用\G。 perldoc perlre说（但请先查阅自己安装的版本手册）：

“\ G”断言可用于链全局匹配（使用“m // g”），如“Regexp Quote-Like”中所述运营商“在perlop。它也是写“lex”时很有用扫描仪，当你有几个您想要匹配的模式反对你的后续子串字符串，请参阅上一个参考。 “\ G”的实际位置匹配也可以通过使用来影响 “pos（）”作为左值：见“pos”in perlfunc。请注意规则零长度匹配被修改在某种程度上，左边的内容 “\ G”的时间不计算在内确定比赛的长度。因此以下内容不匹配永远：
$str = 'ABC';
pos($str) = 1;
while (/.\G/g) {
    print $&;
}

在字符串和记录位置中查找多个子字符串

5 个答案:

试试这个perl程序