我正在尝试扫描字符串以查找单词列表的任意组合。具体来说,我想找到任何“数字”组合,如“二百八十”或“五十八”。
要做到这一点,我已经列出了所有单个数字单词的列表:
numberWords = ["one", "two", "three", ...... "hundred", "thousand", "million"]
然后我使用“|”加入列表并制作了这样的正则表达式:
string.scan(/\b(#{wordList}(\s|\.|,|\?|\!))+/)
我希望这会返回所有数字组合的列表,但它只会单独返回单词。例如,如果字符串中有“三百万”,则返回“三”和“百万”而不是“三百万”。我该如何纠正?
答案 0 :(得分:7)
numberWords = ["one", "two", "three", "hundred", "thousand", "million"]
numberWords = Regexp.union(numberWords)
# => /one|two|three|hundred|thousand|million/
"foo bar three million dollars"
.scan(/\b#{numberWords}(?:(?:\s+and\s+|\s+)#{numberWords})*\b/)
# => ["three million"]
答案 1 :(得分:2)
只是为了好玩,这里有一种更有趣的方式来生成必须匹配长列表的模式:
#!/usr/bin/env perl
use Regexp::Assemble;
my $ra = Regexp::Assemble->new;
foreach (@ARGV) {
$ra->add($_);
}
print $ra->re, "\n";
将其另存为“regexp_assemble.pl
”,安装Perl的Regexp::Assemble模块,然后运行:
perl ./regexp_assemble.pl one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million ' ' '\.' ',' '?' '!'
你应该看到这个生成:
(?^:(?:[ !,.?]|t(?:h(?:irt(?:een|y)|ousand|ree)|w(?:e(?:lve|nty)|o)|en)|f(?:o(?:ur(?:teen)?|rty)|i(?:ft(?:een|y)|ve))|s(?:even(?:t(?:een|y))?|ix(?:t(?:een|y))?)|e(?:ight(?:een|y)?|leven)|nine(?:t(?:een|y))?|hundred|million|one))
这是Perl的模式版本,它需要一些小的调整来满足您的要求:删除前导?^:
及其周围的括号,添加一个尾随+
,并且为了灵活性,使其成为不区分大小写的:
pattern = /(?:[ !,.?]|t(?:h(?:irt(?:een|y)|ousand|ree)|w(?:e(?:lve|nty)|o)|en)|f(?:o(?:ur(?:teen)?|rty)|i(?:ft(?:een|y)|ve))|s(?:even(?:t(?:een|y))?|ix(?:t(?:een|y))?)|e(?:ight(?:een|y)?|leven)|nine(?:t(?:een|y))?|hundred|million|one)+/i
以下是一些scan
结果:
'one dollar'.scan(pattern) # => ["one "]
'one million dollars'.scan(pattern) # => ["one million "]
'one million three hundred dollars'.scan(pattern) # => ["one million three hundred "]
'one million, three hundred!'.scan(pattern) # => ["one million, three hundred!"]
'one million, three hundred and one dollars'.scan(pattern) # => ["one million, three hundred ", " one "]
不幸的是,Ruby没有Perl的Regexp::Assemble模块。它对于这类任务非常有用,因为Ruby中的正则表达式引擎非常快。
唯一的缺点是它捕获前导空格和尾随空格,但通过在字符串上使用map(&:strip)
可以很容易地解决这个问题:
'one million, three hundred and one dollars'.scan(pattern).map(&:strip) # => ["one million, three hundred", "one"]
答案 2 :(得分:0)
我已将Perl的Regexp :: Trie移植到Ruby:
这是Regexp :: Assemble的简单版本,但对我来说还不错。