Question

我无法弄清楚如何处理特定的正则表达式问题。

假设我有一个由方括号中的大量短语组成的大字符串。短语标签（例如S或VP），一个标记（例如w或wSf），该标记旁边的斜杠，然后是标记的描述（例如{ {1}}或CC）。

所以这是一个示例字符串：

VBD_MS3

我想删除整个第一个括号内的短语，并将第二个短语放在其中，如下所示：

[S w#/CC] [VP mSf/VBD_MS3]

使用正则表达式甚至可以吗？

编辑：好的模式是：

[VP wmSf/VBD_MS3]

（第二个括号内的短语可以有一个到任意数量/对）

其中可以是任何可能包含下划线的大写字母序列，而单词可以是任何不是空格的序列（即数字/字符/特殊字符）。

Answer 1

是的，

s|\[S w#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$1 w$2]|;

现在你在找什么模式？

你甚至可以这样做：

s|\[S (w)#/CC\] \[(VP) (mSf/VBD_MS3)\]|[$2 $1$3]|;

Answer 2

在不知道实际形式或位置的情况下，其中一种形式可能有效（未经测试）：

s{\[S (\w+)#/\w+\] (\[VP )(\w+/\w+\])}{$2$1$3}g
或
s{\[(?:S/VP) (\w+)#/\w+\] (\[(?:S/VP) )(\w+/\w+\])}{$2$1$3}g
或
s{\[(?:S/VP)\s+(\w+)#/\w+\]\s+(\[(?:S/VP)\s+)(\w+/\w+\])}{$2$1$3}g

修改由于您的编辑已包含此模式
[ <label> w#/<label>] [<label> <word>/<label> <word>/<label> <word>/<label>...]
它可以更容易地提出一个应该有效的正则表达式。

祝你好运！

use strict; use warnings; $/ = undef; my $data = <DATA>; my $regex = qr{ \[\s* #= Start of token phrase '[' (?&label) \s+ # <label> then whitespace's ((?&word)) # Capture $1 - token word, end grp $1 [#]/(?&label) # '#'/<label> \s* \] #= End of token phrase ']' \s* ( # Capture grp $2 \[\s* #= Start of normal phrase '[' (?&label) \s+ # <label> then whitespace's ) # End grp $2 ( # Capture grp $3 (?&word)/(?&label) # First <word>/<label> pair (?: \s+(?&word)/(?&label) # Optional, many <word>/<label> pair's )* \s* \] #= End of normal phrase ']' ) # End grp $3 (?(DEFINE) ## DEFINE's: (?<label> \w+) # <label> - 1 or more word characters (?<word> [^\s\[\]]+ ) # <word> - 1 or more NOT whitespace, '[' nor ']' ) }x; $data =~ s/$regex/$2$1$3/g; print $data; __DATA__ [S w#/CC] [VP mSf/VBD_MS3]

输出：
[VP wmSf/VBD_MS3]

<强> EDIT2
“如果角色的标签是PP，并且下一个短语的标签是NP，那么在加入时也将下一个短语的标签更改为PP。例如输入：[PP w＃/ IN] [NP something / NN]输出：[PP wsomething / NN]“

当然，如果不添加太多新的捕获组，可以使用回调来完成实际上，有很多方法可以做到这一点，包括正则表达式条件。我认为最简单的方法是使用回调，其中可以做出所有标签决策的逻辑。

use strict; use warnings; $/ = undef; my $data = <DATA>; my $regex = qr{ ( \[\s* # 1 - Token phrase label (?&label) \s+ ) ( # 2 - Token word (?&word) ) [#]/(?&label) \s* \] \s* ( \[\s* # 3 - Normal phrase label (?&label) \s+ ) # insert token word ($2) here ( # 4 - The rest .. (?&word)/(?&label) (?: \s+ (?&word)/(?&label) )* \s* \] ) (?(DEFINE) ## DEFINE's: (?<label> \w+) # <label> - 1 or more word characters (?<word> [^\s\[\]]+ ) # <word> - 1 or more NOT whitespace, '[' nor ']' ) }x; $data =~ s/$regex/ checkLabel($1,$3) ."$2$4"/eg; sub checkLabel { my ($p1, $p2) = @_; if ($p1 =~ /\[\s*PP\s/ && $p2 =~ /(\[\s*)NP(\s)/) { return $1.'PP'.$2; # To use the formatting of the token label, just 'return $p1;' } return $p2; } print $data; __DATA__ [PP w#/CC] [ NP mSf/VBD_MS3]

Answer 3

为什么不将行分成短语，对它们进行操作然后返回它们，而不是创建一个魔法正则表达式来完成整个工作。然后遵循您刚才解释的相同逻辑。

这更干净，更易读（特别是如果你添加评论）和健壮。当然，您需要根据自己的需要进行定制：例如，您可能希望将/个分隔的部分变为键/值对（顺序是否重要？如果不进行hashref）;如果您永远不需要修改标签，也许您不需要拆分/;等

根据评论编辑：这需要在w之前使用文字#，存储它，删除短语，然后将w添加到下一个短语。如果这就是你需要的东西那么。当然，我确信有一些需要注意的边缘情况，所以请先备份并测试！

#!/usr/bin/env perl

use strict;
use warnings;

while( my $line = <DATA> ) {
  #separate phrases, then split phases into whitespace separated pieces
  my @phrases = map { [split /[\s]/] } ($line =~ /\[([^]]+)\]/g);

  my $holder; # holder for 'w' (not really needed if always 'w')
  foreach my $p (@phrases) { # for each phrase
    if ($p->[1] =~ /(w)#/) { # if the second part has 'w#'
      $holder = $1; # keep the 'w' in holder
      $p = undef; #empty to mark for cleaning later
      next; #move to next phrase
    }

    if ($holder) { #if the holder is not empty
      $p->[1] = $holder . $p->[1]; # add the contents of the holder to the second part of this phrase
      $holder = undef; # and then empty the holder
    }
  }

  #remove emptied phrases
  @phrases = grep { $_ } @phrases;

  #reconstitute the line
  print join( ' ', map { '[' . join(' ', @$_) . ']' } @phrases), "\n";
}

__DATA__
[S w#/CC] [VP mSf/VBD_MS3]

再一次，你可以用一个正则表达式做什么看起来很神奇，但是如果你的老板进来并且说“你知道，你写的做X的东西很好，但现在它也需要做Y ”。这就是为什么我喜欢为每个逻辑步骤保持完全独立的逻辑。

Answer 4

#/usr/bin/env perl
use strict;
use warnings;
my $str = "[S w#/CC] [VP mSf/VBD_MS3]";
$str =~ s{\[S w#/CC\]\s*(\[VP\s)(.+)}{$1w$2} and print $str;

perl正则表达式替换

4 个答案: