Question

我正逐行解析制表符分隔文件：

Root rootrank 1 Bacteria domain .72 Firmicutes phylum 1 Clostridia class 1 etc.

=

while (my $line = <$fh>) {
    chomp($line);
}

在每一行，我想捕捉特定比赛前后的第一个条目。例如，对于匹配phylum，我想捕获条目Firmicutes和1。对于匹配domain，我想捕获条目Bacteria和.72。我如何编写正则表达式来执行此操作？

旁注：我不能简单地将逐行分割成一个数组并使用索引，因为有时缺少一个类别或有额外的类别，这会导致条目被一个或两个索引移动。我想避免编写if语句块。

Answer 1

您仍然可以拆分输入，然后将单词映射到索引，并使用比匹配对应的索引来提取相邻单元格：

#!/usr/bin/perl
use warnings;
use strict;
use feature qw{ say };

my @matches = qw( phylum domain );

while (<>) {
    chomp;
    my @cells = split /\t/;
    my %indices;
    @indices{ @cells } = 0 .. $#cells;
    for my $match (@matches) {
        if (defined( my $index = $indices{$match} )) {
            say join "\t", @cells[ $index - 1 .. $index + 1 ];
        }
    }
}

缺少什么：

你应该处理$ index == 0或$ index == $＃cells。
您应该处理一行中重复某些单词的情况。

Answer 2

 my $file = "file2.txt";
 open my $fh, '<', $file or die "Unable to Open the file $file for reading: $!\n";
 while (my $line = <$fh>) {
     chomp $line;
     while ($line =~ /(\w+)\s+(\w+)\s+(\.?\d+)/g) {
     my ($before, $match, $after) = ($1, $2, $3);
     print "Before: $before  Match: $match  After: $after\n";
   }
}

Answer 3

您只需使用以下正则表达式即可捕获匹配词的before和after字词：

(?<LSH>[\w.]+)[\s\t](?<MATCH>.*?)[\s\t](?<RHS>[\w.]+)

请参阅demo / explanation

Answer 4

你可以这样做：

#!/usr/bin/perl
use Modern::Perl;

my @words = qw(phylum domain);
while(<DATA>) {
    chomp;
    for my $word (@words) {
        my ($before, $after) = $_ =~ /(\S+)(?:\t\Q$word\E\t)(\S+)/i;
        say "word: $word\tbefore: $before\tafter: $after";
    }
}

__DATA__
Root rootrank 1 Bacteria domain .72 Firmicutes phylum 1 Clostridia class 1 etc.

<强>输出：

word: phylum    before: Firmicutes  after: 1
word: domain    before: Bacteria    after: .72

Perl Regex - 在比赛前后获取文本

4 个答案: