如何使用正则表达式扫描ruby中的单词组合?

时间:2014-03-12 13:59:04

标签: ruby regex

我正在尝试扫描字符串以查找单词列表的任意组合。具体来说,我想找到任何“数字”组合,如“二百八十”或“五十八”。

要做到这一点,我已经列出了所有单个数字单词的列表:

numberWords = ["one", "two", "three", ...... "hundred", "thousand", "million"]

然后我使用“|”加入列表并制作了这样的正则表达式:

string.scan(/\b(#{wordList}(\s|\.|,|\?|\!))+/)

我希望这会返回所有数字组合的列表,但它只会单独返回单词。例如,如果字符串中有“三百万”,则返回“三”和“百万”而不是“三百万”。我该如何纠正?

3 个答案:

答案 0 :(得分:7)

numberWords = ["one", "two", "three", "hundred", "thousand", "million"]
numberWords = Regexp.union(numberWords)
# => /one|two|three|hundred|thousand|million/

"foo bar three million dollars"
.scan(/\b#{numberWords}(?:(?:\s+and\s+|\s+)#{numberWords})*\b/)
# => ["three million"]

答案 1 :(得分:2)

只是为了好玩,这里有一种更有趣的方式来生成必须匹配长列表的模式:

#!/usr/bin/env perl

use Regexp::Assemble;

my $ra = Regexp::Assemble->new;
foreach (@ARGV) {
    $ra->add($_);
}
print $ra->re, "\n";

将其另存为“regexp_assemble.pl”,安装Perl的Regexp::Assemble模块,然后运行:

perl ./regexp_assemble.pl one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen eighteen nineteen twenty thirty forty fifty sixty seventy eighty ninety hundred thousand million ' ' '\.' ',' '?' '!'

你应该看到这个生成:

(?^:(?:[ !,.?]|t(?:h(?:irt(?:een|y)|ousand|ree)|w(?:e(?:lve|nty)|o)|en)|f(?:o(?:ur(?:teen)?|rty)|i(?:ft(?:een|y)|ve))|s(?:even(?:t(?:een|y))?|ix(?:t(?:een|y))?)|e(?:ight(?:een|y)?|leven)|nine(?:t(?:een|y))?|hundred|million|one))

这是Perl的模式版本,它需要一些小的调整来满足您的要求:删除前导?^:及其周围的括号,添加一个尾随+,并且为了灵活性,使其成为不区分大小写的:

pattern = /(?:[ !,.?]|t(?:h(?:irt(?:een|y)|ousand|ree)|w(?:e(?:lve|nty)|o)|en)|f(?:o(?:ur(?:teen)?|rty)|i(?:ft(?:een|y)|ve))|s(?:even(?:t(?:een|y))?|ix(?:t(?:een|y))?)|e(?:ight(?:een|y)?|leven)|nine(?:t(?:een|y))?|hundred|million|one)+/i

以下是一些scan结果:

'one dollar'.scan(pattern) # => ["one "]
'one million dollars'.scan(pattern) # => ["one million "]
'one million three hundred dollars'.scan(pattern) # => ["one million three hundred "]
'one million, three hundred!'.scan(pattern) # => ["one million, three hundred!"]
'one million, three hundred and one dollars'.scan(pattern) # => ["one million, three hundred ", " one "]

不幸的是,Ruby没有Perl的Regexp::Assemble模块。它对于这类任务非常有用,因为Ruby中的正则表达式引擎非常快。

唯一的缺点是它捕获前导空格和尾随空格,但通过在字符串上使用map(&:strip)可以很容易地解决这个问题:

'one million, three hundred and one dollars'.scan(pattern).map(&:strip) # => ["one million, three hundred", "one"]

答案 2 :(得分:0)

我已将Perl的Regexp :: Trie移植到Ruby:

这是Regexp :: Assemble的简单版本,但对我来说还不错。