当这个字符串与句子的一部分匹配时,从数组中删除字符串 - Perl

时间:2014-11-21 20:31:55

标签: regex perl

我在字符串中匹配多个模式来填充数组。输入文件如下所示:

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins # 2.8
My father [père;parent;papa] lives in New-York # Mon père vit à New-York     # 1.8

我使用此代码:

use strict;
use warnings;
use Data::Dump;

open(TEXT, "<", "$ARGV[0]") 
    or die "cannot open < $ARGV[0]: $!";

while(my $text = <TEXT>)
{
    my @lines = split /\n/, $text;

    foreach my $line (@lines) {
        if ($line =~ /(^(.+)\t(.+)\t(.+)$)/){
            my $english_sentence = $2;
            my $french_sentence = $3;
            my $score = $4;

            print $english_sentence."#".$french_sentence."";

            my @data = map [ split /;/ ], $line =~ / \[ ( [^\[\]]+ ) \] /xg;
            dd \@data;
        }   
        print "\n";
    }
}
close TEXT;

这是输出:

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
Array==>[["chats", "chaton", "chatterie"], ["lapins", "lapereau"]]

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
Array==>[["père", "parent", "papa"]]

当这个字符串与句子的一部分匹配时,我需要删除数组中的字符串。最后,我想得到这样的结果:

 I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
 [["chats"], ["lapins"]]

 My father [père;parent;papa] lives in New-York # Mon père vit à New-York
 [["père"]]

2 个答案:

答案 0 :(得分:1)

这会按照你的要求行事。它只使用带有正则表达式的grep将每个列表减少为只显示在法语句子中的那些词。

use utf8;
use strict;
use warnings;
use 5.010;
use autodie;

use open qw/ :std :encoding(UTF-8) /;

use Data::Dump;

open my $fh, '<', 'sentences.txt';

while (<$fh>) {

  my @sentences = split /\s*#\s*/;
  next unless @sentences == 3;

  print join(' # ', @sentences[0,1]), "\n";

  my @data = map [ split /;/ ], $sentences[0] =~ / \[ ( [^\[\]]+ ) \] /xg;
  $_ = [ grep { $sentences[1] =~ /\b\Q$_\E\b/ } @$_ ] for @data;

  dd \@data;
  print "\n";
}

<强>输出

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
[["chats"], ["lapins"]]

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
[["p\xE8re"]]

<强>更新

根据要求,此代码将就地修改单词列表,以便它们包含仅出现在翻译中的 字样。

use utf8;
use strict;
use warnings;
use 5.010;
use autodie;

use open qw/ :std :utf8 /;

open my $fh, '<', 'sentences.txt';

while (<$fh>) {

  my @sentences = split /\s*#\s*/;
  next unless @sentences == 3;

  print join(' # ', @sentences[0,1]), "\n";

  $sentences[0] =~ s{ \[ ( [^\[\]]+ ) \] }{
    my @words = split /;/, $1;
    @words = grep { $sentences[1] =~ /\b\Q$_\E\b/ } @words;
    sprintf "[%s]", join ';', @words;
  }exg;

  print join(' # ', @sentences[0,1]), "\n\n";

}

<强>输出

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins
I love cat [chats] and rabbit [lapins] # J'aime les chats et les lapins

My father [père;parent;papa] lives in New-York # Mon père vit à New-York
My father [père] lives in New-York # Mon père vit à New-York

答案 1 :(得分:0)

您也可以通过创建法语句子的散列来实现此目的 这可能会更快,因为它避免了第三个正则表达式。

use strict;
use warnings;

while (<DATA>) {
    my ($English, $French, $repl, %FrWords);
    if ( ($English, $French) = m/^([^#]*)\#([^#]*)\#/ ) {
        @FrWords{ split /\h+/, $French } = undef;
        $English =~ s{ \[ ([^\[\]]*) \] }{
                 $repl = join( ';', grep { exists $FrWords{$_} } split /;/, $1 );
                 '['. (length($repl) ? $repl : '') .']';
            }xeg;
        print $English, '#', $French, "\n";
    }
}
__DATA__

I love cat [chats;chaton;chatterie] and rabbit [lapins;lapereau] # J'aime les chats et les lapins # 2.8
My father [père;parent;papa] lives in New-York # Mon père vit à New-York     # 1.8

输出

I love cat [chats] and rabbit [lapins] # J'aime les chats et les lapins 
My father [père] lives in New-York # Mon père vit à New-York