我会把它放在那里:我对正则表达式很糟糕。我试图想出一个来解决我的问题,但我真的不太了解它们。 。
想象一下以下几句话:
- 你好,等等。它大概是11 1/2“x 32”。
- 尺寸为8 x 10-3 / 5!
- 可能在22“x 17”的某个地方。
- 卷很大:42 1/2“x 60码。
- 它们都是5.76乘8帧。
- 是的,也许它长约84厘米。
- 我想13/19“。
- 不,实际上可能是86厘米。
我希望尽可能干净地从这些句子中提取项目维度。在完美的世界中,正则表达式将输出以下内容:
- 11 1/2“x 32”
- 8 x 10-3 / 5
- 22“x 17”
- 42 1/2“x 60 yd
- 5.76 by 8
- 84厘米
- 13/19"
- 86 cm
我想象一个适用以下规则的世界:
{cm, mm, yd, yards, ", ', feet}
,但我更倾向于考虑任意一组单位的解决方案,而不是上述单位的明确解决方案。4/5"
。/
分隔分子/分母,人们可以假设各部分之间没有空间(尽管如果有人认为那个很好!)。{x, by}
。如果维度只是一维,则必须具有上述集合中的单位,即22 cm
是正常的,.333
不是,4.33 oz
也不是[1-9]+[/ ][x1-9]
。 为了向你展示我对正则表达式的无用(并告诉我至少尝试过!),我就这么做了。 。
\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)(?:\s*x\s*|\s*by\s*)?(?:\d+(?:\.\d+)?[\s*-]*(?:\d+(?:\/\d+)?)?(?:cm|mm|yd|"|'|feet)?)?
更新(2)
你们这些人非常快速有效!我将添加一些以下正则表达式未涵盖的测试用例:
- 最后一个测试用例是12码x。
- 最后一个测试用例是99厘米。
- 这句话没有尺寸:342/5553 / 222。
- 三维? 22“x 17”x 12 cm
- 这是产品代码:c720,其他数字为83 x更好。
- 一个数字本身就是21。
- 体积不应与0.332盎司相匹配。
这些应该导致以下结果(#表示什么都不匹配):
- 12码
- 99 cm
- #
- 22“x 17”x 12 cm
- #
- #
- #
我已经在下面修改了M42's回答:
{{1}}
但是虽然这解决了一些新的测试用例,但它现在无法匹配以下其他测试用例。它报道:
- 11 1/2“x 32”PASS
- (无)FAIL
- 22“x 17”PASS
- 42 1/2“x 60 yd PASS
- (无)FAIL
- 84cm PASS
- 13/19“PASS
- 86 cm PASS
- 22“PASS
- (无)FAIL
(无)FAIL
12码x失败
- 99 cm by FAIL
- 22“x 17”[并且,但另外'12 cm'] FAIL
PASS
PASS
答案 0 :(得分:5)
新版本,靠近目标,2次测试失败
#!/usr/local/bin/perl
use Modern::Perl;
use Test::More;
my $re1 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet)/;
my $re2 = qr/(?:\s*x\s*|\s*by\s*)/;
my $re3 = qr/\d+(?:\.\d+)?[\s-]*(?:\d+)?(?:\/\d+)?(?:cm|mm|yd|"|'|feet|frames)/;
my @out = (
'11 1/2" x 32"',
'8 x 10-3/5',
'22" x 17"',
'42 1/2" x 60 yd',
'5.76 by 8 frames',
'84cm',
'13/19"',
'86 cm',
'12 yd',
'99 cm',
'no match',
'22" x 17" x 12 cm',
'no match',
'no match',
'no match',
);
my $i = 0;
my $xx = '22" x 17"';
while(<DATA>) {
chomp;
if (/($re1(?:$re2$re3)?(?:$re2$re1)?)/) {
ok($1 eq $out[$i], $1 . ' in ' . $_);
} else {
ok($out[$i] eq 'no match', ' got "no match" in '.$_);
}
$i++;
}
done_testing;
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
The last but one test case is 12 yd x.
The last test case is 99 cm by.
This sentence doesn't have dimensions in it: 342 / 5553 / 222.
Three dimensions? 22" x 17" x 12 cm
This is a product code: c720 with another number 83 x better.
A number on its own 21.
A volume shouldn't match 0.332 oz.
<强>输出:强>
# Failed test ' got "no match" in The dimensions are 8 x 10-3/5!'
# at C:\tests\perl\test6.pl line 42.
# Failed test ' got "no match" in They are all 5.76 by 8 frames.'
# at C:\tests\perl\test6.pl line 42.
# Looks like you failed 2 tests of 15.
ok 1 - 11 1/2" x 32" in Hello blah blah. It's around 11 1/2" x 32".
not ok 2 - got "no match" in The dimensions are 8 x 10-3/5!
ok 3 - 22" x 17" in Probably somewhere in the region of 22" x 17".
ok 4 - 42 1/2" x 60 yd in The roll is quite large: 42 1/2" x 60 yd.
not ok 5 - got "no match" in They are all 5.76 by 8 frames.
ok 6 - 84cm in Yeah, maybe it's around 84cm long.
ok 7 - 13/19" in I think about 13/19".
ok 8 - 86 cm in No, it's probably 86 cm actually.
ok 9 - 12 yd in The last but one test case is 12 yd x.
ok 10 - 99 cm in The last test case is 99 cm by.
ok 11 - got "no match" in This sentence doesn't have dimensions in it: 342 / 5553 / 222.
ok 12 - 22" x 17" x 12 cm in Three dimensions? 22" x 17" x 12 cm
ok 13 - got "no match" in This is a product code: c720 with another number 83 x better.
ok 14 - got "no match" in A number on its own 21.
ok 15 - got "no match" in A volume shouldn't match 0.332 oz.
1..15
似乎很难匹配5.76 by 8 frames
但不匹配0.332 oz
,有时您必须将数字与单位和数字匹配,而不是单位。
对不起,我无法做得更好。
答案 1 :(得分:2)
许多可能的解决方案之一(应该是nlp兼容的,因为它只使用基本的正则表达式语法):
foundMatch = Regex.IsMatch(SubjectString, @"\d+(?: |cm|\.|""|/)[\d/""x -]*(?:\b(?:by\s*\d+|cm|yd)\b)?");
会得到你的结果:)
<强>解释强>
"
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
\ # Match the character “ ” literally
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (attempting the next alternative only if this one fails)
\. # Match the character “.” literally
| # Or match regular expression number 4 below (attempting the next alternative only if this one fails)
"" # Match the character “""” literally
| # Or match regular expression number 5 below (the entire group fails if this one fails to match)
/ # Match the character “/” literally
)
[\d/""x -] # Match a single character present in the list below
# A single digit 0..9
# One of the characters “/""x”
# The character “ ”
# The character “-”
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
(?: # Match the regular expression below
\b # Assert position at a word boundary
(?: # Match the regular expression below
# Match either the regular expression below (attempting the next alternative only if this one fails)
by # Match the characters “by” literally
\s # Match a single character that is a “whitespace character” (spaces, tabs, line breaks, etc.)
* # Between zero and unlimited times, as many times as possible, giving back as needed (greedy)
\d # Match a single digit 0..9
+ # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
| # Or match regular expression number 2 below (attempting the next alternative only if this one fails)
cm # Match the characters “cm” literally
| # Or match regular expression number 3 below (the entire group fails if this one fails to match)
yd # Match the characters “yd” literally
)
\b # Assert position at a word boundary
)? # Between zero and one times, as many times as possible, giving back as needed (greedy)
"
答案 2 :(得分:2)
这就是我在'Perl'中使用正则表达式所能得到的。尝试使其适应你的正则表达式风格:
\d.*\d(?:\s+\S+|\S+)
说明:
\d # One digit.
.* # Any number of characters.
\d # One digit. All joined means to find all content between first and last digit.
\s+\S+ # A non-space characters after some space. It tries to match any unit like 'cm' or 'yd'.
| # Or. Select one of two expressions between parentheses.
\S+ # Any number of non-space characters. It tries to match double-quotes, or units joined to the
# last number.
我的测试:
script.pl 的内容:
use warnings;
use strict;
while ( <DATA> ) {
print qq[$1\n] if m/(\d.*\d(\s+\S+|\S+))/
}
__DATA__
Hello blah blah. It's around 11 1/2" x 32".
The dimensions are 8 x 10-3/5!
Probably somewhere in the region of 22" x 17".
The roll is quite large: 42 1/2" x 60 yd.
They are all 5.76 by 8 frames.
Yeah, maybe it's around 84cm long.
I think about 13/19".
No, it's probably 86 cm actually.
运行脚本:
perl script.pl
结果:
11 1/2" x 32".
8 x 10-3/5!
22" x 17".
42 1/2" x 60 yd.
5.76 by 8 frames.
84cm
13/19".
86 cm