Question

我在文件中包含以下内容：

GI |的 170570902 | GB | ABLA01000008.1 | 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM

对于文件中的每一行，我想提取上面突出显示的数字并将其推送到数组。我正在尝试grep这个数字并从匹配的行中提取它，但我似乎找不到正确的方法。

以下是我的想法：

while ($sec_gi = <IN_SIDS>){
    $sec_gi =~ s/[0-9]{5,}/$&/;
    print $sec_gi."\n";
}

$＆安培;应该是完全匹配字符串。有了这个，我得到匹配的行除了匹配模式，这与我想要的完全相反。

有人可以帮忙吗？

谢谢！

Answer 1

看起来split是最简单的解决方案（ETA优化）：

while (<IN_SIDS>) {
    my $nums  = (split /\|/, $field, 3)[1];
    print "$nums\n";
    push @array, $nums;
}

我做了一个基准来比较效率和正则表达式解决方案：

#!/usr/bin/perl
use strict;
use warnings;

my $data = "gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM";

use Benchmark qw(cmpthese);

cmpthese(shift, {
        'Regex' => \&regex,
        'Split' => \&splitting
    });

sub regex {
    if ($data =~ /^[^|]+\|(\d{5,})\|/) {
        return $1;
    }
}

sub splitting {
    return (split /\|/, $data, 3)[1];
}

结果是平局：

tlp@ubuntu:~/perl$ perl tx.pl 1000000
           Rate Split Regex
Split 2083333/s    --   -2%
Regex 2127660/s    2%    --

感谢M42的评论建议。我选择split解决方案是为了简单和易于维护，而不是性能，但截至目前，它等同于正则表达式解决方案。

Answer 2

你也可以

$sec_gi =~ /([0-9]{5,})/;

print "$1\n";

Answer 3

您可以使用：

$sec_gi =~ s/.*?\|(\d{5,}).*/\1/;

但是如果它总是在第二列，你可以使用split：

@lst = split('\|', $sec_gi );
$sec_gi = $lst[1];

Answer 4

不妨给你回答＃3：

# Declare Array outside the loop
my @my_array;
while ( $sec_gi = <IN_SIDS> ){
    chomp $sec_gi;

    # Test if this field actually exists

    if ( $sec_gi =~ /([0-9]{5,})/ ) {

        # Field exists, push it into your array (or print it)\

        push @my_array, $1;
    }
    else {

        # Field doesn't exist: Take appropriate action (which might mean none)

        print "Field not found\n";
    }
}

# Array @my_array has all of your values

yadda, yadda, yadda

顺便说一句，这将定位该字段，无论它出现在哪一行。如果此号码仅在字段＃1中，则您要使用split：

my @my_array;
while ( $sec_gi = <IN_SIDS> ) {
    chomp $sec_gi;
    @sec_gi_array = split /\|/, $sec_gi;
    if ( $sec_gi_array[1] =! /[0-9]{5,}/ ) {
         push @my_array, $sec_gi_array[1];
    }
    else {
         print "Field not found\n";
    }
}

Answer 5

如果值始终是您的第二个字段，则可以使用此字段：

while ($sec_gi = <IN_SIDS>) {
  if ($sec_gi =~ m/^[^|]*\|([^|]+)/) {
    print "$1\n";
  }
}

如果某些第二个字段并不总是您想要的那个（IE只需要5个或更多数字，如暗示的那样）那么您可能更具体：

while ($sec_gi = <IN_SIDS>) {
  if ($sec_gi =~ m/^[^|]*\|(\d{5,})/) {
    print "$1\n";
  }
}

如果您的perl脚本仅执行此操作，则可以使用gnu coreutil cut（man cut）。

Answer 6

假设您不必使用grep，以下简短程序将起作用。

希望这有帮助。

凯特琳

#!/usr/bin/perl
use strict;
use warnings;

my @array;

for ( <DATA> )
{
    push @array, $1 if /gi\|(\d+)\|/;
}

for (@array) {
    print "$_\n";
}

__DATA__
gi|170570902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|178370902|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170593502|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170578993|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM
gi|170898368|gb|ABLA01000008.1| 0.457 24 0.581 24 0.876 11 0.744 0.669 Y 0.450 SignalP-noTM

Answer 7

您需要指定一个捕获组：

  while ($sec_gi = <IN_SIDS>){
     $sec_gi =~ s/^.*([0-9]{5,}).*$/$1/;
     print $sec_gi."\n";
 }

Perl：使用grep提取匹配文件行substr的模式

7 个答案: