我有一串文字被分成短语,每个短语都用方括号括起来:
[pX textX/labelX] [pY textY/labelY] [pZ textZ/labelZ] [textA/labelA]
有时一个块不是以p字符开头的(就像上面的最后一个)。
我的问题是我需要抓住每个块。在正常情况下这没关系,但有时这个输入格式错误,例如,某些块可能只有一个括号,或者没有。所以它可能看起来像这样:
[pX textX/labelX] pY textY/labelY] textZ/labelZ
但它应该像这样出现:
[pX textX/labelX] [pY textY/labelY] [textZ/labelZ]
问题不包括嵌套括号。在潜入大量不同人群的正则表达式解决方案之后(我正在使用正则表达式),下载备忘单和获取正则表达式工具(Expresso)我仍然不知道如何做这个。有任何想法吗?也许正则表达式不起作用。但这个问题是如何解决的?我认为这不是一个非常独特的问题。
这是一个具体的例子:
$data= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";
这是@FailedDev的一个非常紧凑的解决方案:
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) { # matched text = $& }
但我认为需要加入两点来强调问题:
然而,由于这种情况是一个固定的(即一个PUNCTUATION标记后跟一个右边只有一个方括号的文本/标签图案),我将其硬编码到这样的解决方案中:< / p>
my @stuff;
while ($data =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
if($& =~ m/(^[\S]\/PUNC )(.*\])/) # match a "./PUNC" mark followed by a "phrase]"
{
@bits = split(/ /,$&); # split by space
push(@stuff, $bits[0]); # just grab the first chunk before space, a PUNC
push(@stuff, substr($&, 7)); # after that space is the other chunk
}
else { push(@stuff, $&); }
}
foreach(@stuff){ print $_; }
尝试我在编辑中添加的示例,除了一个问题外,这个工作正常。最后一个./PUNC被省略,因此输出为:
[VP sysmH/VBD_MS3]
[PP ll#/IN_DET Axryn/NNS_MP]
,/PUNC
w#hm/CC_PRP_MP3]
[NP AEDA'/NN]
,/PUNC
[PP b#/IN m/NN_FS]
[NP >HyAnA/NN]
如何保留最后一块?
答案 0 :(得分:3)
你可以用这个
/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*)/
假设你的字符串是这样的:
[pX textX/labelX] pY textY/labelY] pY textY/labelY] pY textY/labelY] [pY textY/labelY] [3940-823490-2 [30-94823049 [32904823498]
例如:pY [[[textY/labelY]
Perl特定解决方案:
while ($subject =~ m/(?:\[[^[]*?\]|[^[ ].*?\]|\[[^[ ]*)/g) {
# matched text = $&
}
更新:
/(?:\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s+[^[]+?(?:\s+|$))/
这适用于您更新的字符串,但如果需要,您应该修剪结果的空白。
更新:2
/(\[[^[]*?]|[^[ ].*?]|\[[^[ ]*|\s*[^[]+?(?:\s+|$))/
我建议打开一个不同的问题,因为你原来的问题与上一个完全不同。
"
( # Match the regular expression below and capture its match into backreference number 1
# Match either the regular expression below (attempting the next alternative only if this one fails)
\[ # Match the character “[” literally
[^[] # Match any character that is NOT a “[”
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
] # Match the character “]” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
[^[ ] # Match a single character NOT present in the list “[ ”
. # Match any single character that is not a line break character
*? # Between zero and unlimited times, as few times as possible, expanding as needed (lazy)
] # Match the character “]” literally
| # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
\[ # Match the character “[” literally
[^[ ] # Match a single character NOT present in the list “[ ”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 4 below (the entire group fails if this one fails to match)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
[^[] # Match any character that is NOT a “[”
+? # Between one and unlimited times, as few times as possible, expanding as needed (lazy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (the entire group fails if this one fails to match)
$ # Assert position at the end of the string (or before the line break at the end of the string, if any)
)
)
"
答案 1 :(得分:0)
s{
\[?
(?: ([^\/]\s]+) \s+ )?
([^\]/\s]+)
/
([^\]/\s]+)
\]?
}{
'[' .
( defined($1) ? "$1 " : '' ) .
$2 .
'/' .
$3 .
']'
}xeg;
答案 2 :(得分:0)
这与我应用于previous problem的程序基本相同,我刚刚更改了map
:
#!/usr/bin/perl
use strict;
use warnings;
my $string= "[VP sysmH/VBD_MS3] [PP ll#/IN_DET Axryn/NNS_MP] ,/PUNC w#hm/CC_PRP_MP3] [NP AEDA'/NN] ,/PUNC [PP b#/IN m\$Arkp/NN_FS] [NP >HyAnA/NN] ./PUNC";
my @items = split(/(\[.+?\])/, $string);
my @new_items = map {
if (/^\[.+\]$/) { # items in []
$_;
}
elsif (/\s/) {
grep m/\w/, split(/\s+/); # use grep to eliminate the split results that are the empty string
}
else { # discard empty strings
}
} @items;
print "--$_--\n" for @new_items;
你得到的输出是这个(连字符仅用于说明没有前导/尾随空格):
--[VP sysmH/VBD_MS3]--
--[PP ll#/IN_DET Axryn/NNS_MP]--
--,/PUNC--
--w#hm/CC_PRP_MP3]--
--[NP AEDA'/NN]--
--,/PUNC--
--[PP b#/IN m$Arkp/NN_FS]--
--[NP >HyAnA/NN]--
--./PUNC--
我认为这是您想要获得的结果。我不知道你是否会对非'仅限'解决方案'的解决方案感到满意......