Question

我正在使用Perl和正则表达式来解析（格式不佳）输入文本文件中的条目。我的代码将输入文件的内容存储到$ genes中，我已经定义了一个带有捕获组的正则表达式，用于将有趣的位存储在三个变量中：$ number，$ name和$ sequence（参见下面的Script.pl片段）。

这一切都完美无缺，直到我尝试打印出$ sequence的值。我试图在值周围添加引号，我的输出看起来像这样：

Number: '132'
Name: 'rps12 AmtrCp046'
'equence: 'ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA

Number: '134'
Name: 'psbA AmtrCp001'
'equence: 'ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA

注意序列中缺少的S已被单引号替换，并注意序列本身没有像我预期的那样有引号。我无法弄清楚为什么$ sequence的print语句表现得如此奇怪。我怀疑我的正则表达式有问题，但我对这可能是什么没有任何想法。任何帮助将不胜感激！

Script.pl摘录

while ($genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+\s)/g) {
   # Get the value of the first capture group in the matched string (the first bit of stuff in parenthesis)
   # ([0-9+)
   $number = $1;

   # Get the value of the fourth capture group
   # ([A-Za-z0-9]*\s+[A-Za-z0-9]+)
   $name = $4;

   # Get the value of the fifth capture group
   # ([ACGT]+\s)
   $sequence = $5;

   print "Number: \." . $number . "\.\n";
   print "Name: \'" . $name . "\'\n";
   print "sequence: \'" . $sequence . "\'\n";
   print "\n";
}

输入文件摘要

132 gnl | Ambtr | rps12 AmtrCp046 ATGAATCTCAATGACCAAGAATTGGCAATTGACACTGAAAGGAACTATAGAATACCTGGAATCTCACAAA AATCTGAATTTTTAGAAATTGTTCATTCAATTAATTTCAAATAACATATTCGTGGAATACGATTCACTTT CAAGATGCCTTGATGGTGAAATGGTAGACACGCGAGACTCAAAATCTCGTGCTAAAGAGCGTGGAGGTTC GAGTCCTCTTCAAGGCATTGAGAATGCTCATTGAATGAGCAATTCAATAACAGAAACAGATCTCGGATCT AATCGATATTGGCAAGTTTCATACGAAGTATTCCGGCGATCCCCACGATCCGAGTCCGAGCTGTTGTTTG ATTTAGTTATTCAGTTAAACCA

>134          gnl|Ambtr|psbA AmtrCp001
ATGATCCCTACCTTATTGACCGCAACTTCTGTATTTATTATCGCCTTCATTGCGGCTCCTCCAGTAGATA
TTGATGGGATCCGTGAACCTGTTTCTGGTTCTCTACTTTATGGAAACAATATTCTTTCTGGTGCCATTAT
TCCAACCTCTGCAGCTATAGGTTTGCATTTTTACCCAATATGGGAAGCGGCATCCGTTGATGAATGGTTA
TACAATGGTGGTCCTTATGAGTTAATTGTCCTACACTTCTTACTTAGTGTAGCTTGTTACATGGGTCGTG
AGTGGGAACTTAGTTTCCGTCTGGGTATGCGCCCTTGGATTGCTGTTGCATATTCAGCTCCTGTTGCAGC
TGCTACTGCTGTTTTCTTGATCTACCCTATTGGTCAAGGAAGTTTCTCAGATGGTATGCCTCTAGGAATA
TCTGGTATTTTCAACTTGATGATTGTATTCCAGGCGGAGCACAACATCCTTATGCACCCATTTCACATGT
TAGGCGTAGCTGGTGTATTCGGCGGCTCCCTATTCAGTGCTATGCATGGTTCCTTGGTAACCTCTAGTTT
GATCAGGGAAACCACTGAAAATGAGTCTGCTAATGCAGGTTACAGATTCGGTCAAGAGGAAGAAACCTAT
AATATCGTAGCTGCTCATGGTTATTTTGGTCGATTGATCTTCCAATATGCTAGTTTCAACAATTCTCGTT
CCTTACATTTCTTCCTAGCTGCTTGGCCCGTAGTAGGTATTTGGTTCACTGCTTTGGGTATTAGCACTAT
GGCTTTCAACCTAAATGGTTTCAATTTCAACCAATCCGTAGTTGACAGTCAAGGTCGTGTCATCAACACT
TGGGCTGATATAATCAACCGTGCTAACCTTGGTATGGAAGTTATGCATGAACGTAATGCTCACAATTTCC
CTCTAGACTTAGCTGCTGTTGAAGCTCCATCTACAAATGGATAA

Answer 1

输入文件似乎使用CR + LF来结束行。您将它存储到$ sequence（因为\s在捕获括号内）。打印时，它将光标移动到一行的开头，然后打印最终报价，覆盖“序列”中的“S”。

解决方案：不要捕获变量中的最后一个空格。

$genes =~ />([0-9]+)\s+([A-Za-z]+)\|([A-Za-z]+)\|([A-Za-z0-9]*\s+[A-Za-z0-9]+)\s+([ACGT]+)\s/g
#                                                                                        ^^^

Answer 2

  while ($genes =~ m/^.*?([0-9]+).*\|([\w ]+)(.+)$/simg) {

   # Get the value of the first capture group
   $number = $1;

   # Get the value of the second capture group
   $name = $2;

   # Get the value of the third capture group
   # ([ACGT]+\s)
   $sequence = $3;

   print "Number: \." . $number . "\.\n";
   print "Name: \'" . $name . "\'\n";
   print "sequence: \'" . $sequence . "\'\n";
   print "\n";
}

<强>说明

Options: dot matches newline; case insensitive; ^ and $ match at line breaks

Assert position at the beginning of a line (at beginning of the string or after a line break character) «^»
Match any single character «.*?»
   Between zero and unlimited times, as few times as possible, expanding as needed (lazy) «*?»
Match the regular expression below and capture its match into backreference number 1 «([0-9]+)»
   Match a single character in the range between “0” and “9” «[0-9]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Match any single character «.*»
   Between zero and unlimited times, as many times as possible, giving back as needed (greedy) «*»
Match the character “|” literally «\|»
Match the regular expression below and capture its match into backreference number 2 «([\w ]+)»
   Match a single character present in the list below «[\w ]+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
      A word character (letters, digits, and underscores) «\w»
      The character “ ” « »
Match the regular expression below and capture its match into backreference number 3 «(.+)»
   Match any single character «.+»
      Between one and unlimited times, as many times as possible, giving back as needed (greedy) «+»
Assert position at the end of a line (at the end of the string or before a line break character) «$»

Perl - 来自正则表达式匹配的输出非常奇怪，确实

2 个答案: