Question

我使用perl在不同标题下列出的不同序列的文件中搜索特定字符串。当有一个序列存在时，我能够编写脚本，即一个标题，但我无法推断它。假设我需要在给定文件中搜索某些字符串“FSFSD”，例如： 无法搜索文件是否包含以下内容：

Polons
  CACAGTGCTACGATCGATCGATDDASD
  HCAYCHAYCHAYCAYCSDHADASDSADASD
  Seliems
  FJDSKLFJSLKFJKASFJLAKJDSADAK
  DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
  Teerag
  DFAKJASKDJASKDJADJLLKJ
  SADSKADJALKDJSKJDLJKLK

可以在文件有一个标题时进行搜索，例如：

人族
  FDKFJSKFJKSAFJALKFJLLJ
  DKDJKASJDKSADJALKJLJKL
  DJKSAFDHAKJFHAFHFJHAJJ

我需要将结果输出为标题 abc 下的“String xyz ”

我使用的代码是：

print "Input the file name \n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file \n\n";
exit;
}
@prot= <protein>;
close protein;
$newprotein=join("",@prot);
$protein=~s/\s//g;
do{
print "enter the motif to be searched \n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif \n\n";
}
else{
print "not found \n\n";
}
}
until ($motif=~/^\s*$/);
exit;

Answer 1

看到您的代码，我想在不回答您的问题的情况下提出一些建议：

始终，总是 use strict;。对于你可能（或可能不）相信的任何更高能力的爱，use strict;。
每次use strict;，您都应该use warnings;。
另外，请认真考虑使用一些缩进。
另外，请考虑为不同的变量使用明显不同的名称。
最后，你的风格确实不一致。这是你的所有代码还是你一起修补它？不要试图侮辱你或任何东西，但我建议不要复制你不理解的代码 - 至少尝试才能复制它。

现在，您的代码的可读性更高，包括一些修复和一些您可能想要做的猜测，如下：

use strict;
use warnings;

print "Input the file name:\n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open file\n\n";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/\s//g;
while(1) {
  print "enter the motif to be searched:\n";
  my $motif = <STDIN>;
  last if $motif =~ /^\s*$/;
  chomp $motif;
  # here I might even use the ternary ?: operator, but whatever
  if ($newprotein =~ /$motif/) {
    print "found motif\n\n";
  }
  else {
    print "not found\n\n";
  }
}

Answer 2

主要问题是如何区分标题和数据，从您的示例中我假设一行是标题iff它包含一个小写字母。

use strict;
use warnings;
print "Enter the motif to be searched \n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
    if(/[a-z]/) {
        $header = $_;
        next;
    }
    if (/$motif/o) {
        print "Found $motif under header $header\n";
        exit;
    }
}
print "$motif not found\n";

Answer 3

所以你说你能够阅读一行并完成这项任务。但是，如果文件中有多行，则无法执行相同的操作？

只需循环并逐行读取文件。

$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
 //same command that you do for one 'heading' will go here. $line represents one heading
}

Answer 4

编辑：您发布的示例没有明确的分隔符，您需要在标题和序列之间找到明确的区分。您可以使用多个换行符或非字母数字字符，例如“，”。无论您选择什么，让以下代码中的WHITESPACE等于您选择的分隔符。如果您对所使用的格式感兴趣，则必须更改以下语法以忽略空格并通过大小写分隔（使其稍微复杂一些）。

简单方法（O（n ^ 2）？）是使用空格分隔符分割文件，给出一组标题和序列（heading [i] = split_array [i * 2]，sequence [i] = split_array [I * 2 + 1]）。对于每个序列执行正则表达式。

稍微困难的方式（O（n）），给定BNF语法如：

file: block
    | file block
    ;

block: heading sequence

heading: [A-Z][a-z]

sequence: [A-Z][a-z]

尝试递归正常解析（伪代码，我不知道perl）：

GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"

FUNC file ()
    WHILE nextChar() != EOF
        block()
        printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
    END WHILE
END FUNC

FUNC block ()
    heading()
    sequence()
END FUNC

FUNC heading ()
    in = popChar()
    IF in == WHITESPACE
        sequenceHeading = tempHeading
        tempHeading = ""
        RETURN
    END IF
    tempHeading &= in
END FUNC

FUNC sequence ()
    in = popChar()
    IF in == WHITESPACE
        sequenceCount = count
        count = 0
        i = 0
    END IF
    IF in == substring[i]
        i++
        IF i > substringLength
            count++
        END IF
    ELSE
        i = 0
    END IF
END FUNC

有关递归正常解析的详细信息，请查看Let's Build a Compiler或Wikipedia。

Answer 5

use strict;
use warnings;
use autodie qw'open';

my($filename,$motif) = @ARGV;

if( @ARGV < 1 ){
  print "Please enter file name:\n";
  $filename = <STDIN>;
  chomp $filename;
}

if( @ARGV < 2 ){
  print "Please enter motif:\n";
  $motif = <STDIN>;
  chomp $motif;
}

my %data;

# fill in %data;
{
  open my $file, '<', $filename;

  my $heading;
  while( my $line = <$file> ){
    chomp $line;
    if( $line ne uc $line ){
      $heading = $line;
      next;
    }
    if( $data{$heading} ){
      $data{$heading} .= $line;
    } else {
      $data{$heading}  = $line;
    }
  }
}

{
  # protect against malicious users
  my $motif_cmp = quotemeta $motif;

  for my $heading ( keys %data ){
    my $data = $data{$heading};

    if( $data =~ /$motif_cmp/ ){
      print "String $motif found under Heading $heading\n";
      exit 0;
    }
  }

  die "String $motif not found anywhere in file $filename\n";
}

如何在不同标题的文件中搜索字符串？

5 个答案: