正则表达式扫描失败

时间:2019-02-15 03:18:19

标签: regex ruby

我正在尝试解析字符串中的所有钱。例如,我要提取:

['$250,000', '$3.90', '$250,000', '$500,000']

来自:

'Up to $250,000………………………………… $3.90 Over $250,000 to $500,000'

正则表达式:

\$\ ?(\d+\,)*\d+(\.\d*)?

似乎与this link中的所有货币表达式匹配。但是,当我尝试在Ruby上使用scan时,它无法给我想要的结果。

s # => "Up to $250,000 $3.90 Over $250,000 to $500,000, add$3.70 Over $500,000 to $1,000,000, add..$3.40 Over $1,000,000 to $2,000,000, add...........$2.25\nOver $2,000,000 add ..$2.00"
r # => /\$\ ?(\d+\,)*\d+\.?\d*/
s.scan(r)
# => [["250,"], [nil], ["250,"], ["500,"], [nil], ["500,"], ["000,"], [nil], ["000,"], ["000,"], [nil], ["000,"], [nil]]

String#scan文档来看,这似乎是由于该组。我如何解析字符串中的所有钱?

2 个答案:

答案 0 :(得分:2)

让我们看看您的正则表达式,我将以 free-spacing模式编写该正则表达式,以便对其进行记录:

r = /
    \$     # match a dollar sign
    \ ?    # optionally match a space (has no effect) 
    (      # begin capture group 1
      \d+  # match one or more digits
      ,    # match a comma (need not be escaped)
    )*     # end capture group 1 and execute it >= 0 times
    \d+    # match one or more digits
    \.?    # optionally match a period
    \d*    # match zero or more digits
    /x     # free-spacing regex definition mode

在非自由间隔模式下,将编写如下。

r = /\$ ?(\d+,)*\d+\.?\d*/

当在自由空间模式下定义正则表达式时,在评估正则表达式之前会删除所有空格,这就是为什么我必须转义空格的原因。如果未在自由空间模式下定义正则表达式,则没有必要。

在美元符号后不需要空格来匹配空格,因此应删除\ ?。假设现在有

r = /\$\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
  #=> ["$2.31", "$44.", "$33.607"]

可以,但是是否要匹配小数点后两位没有精确数字的值是个问题。

现在写

r = /\$(\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
  #=> [[nil], [nil], [nil]]

要了解为什么获得此结果,请检查String#scan的文档,尤其是第一段的最后一句话:“如果模式包含组,则每个单独的结果本身就是一个数组,每个组包含一个条目。” 。

我们可以通过将捕获组更改为非捕获组来避免该问题:

r = /\$(?:\d+,)*\d+\.?\d*/
"$2.31 cat $44. dog $33.607".scan r
  #=> ["$2.31", "$44.", "$33.607"] 

现在考虑一下:

"$2,241.31 cat $1,2345. dog $33.607".scan r
  #=> ["$2,241.31", "$1,2345.", "$33.607"]

这仍然不太正确。请尝试以下操作。

r = /
    \$          # match a dollar sign
    \d{1,3}     # match one to three digits
    (?:,\d{3})  # match ',' then 3 digits in a nc group
    *           # execute the above nc group >=0 times
    (?:\.\d{2}) # match '.' then 2 digits in a nc group
    ?           # optionally match the above nc group
    (?![\d,.])  # no following digit, ',' or '.'
    /x          # free-spacing regex definition mode

"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
  #=> ["$2,241.31", "$2", "$1,234", "$146.27"]

(?![\d,.])负前瞻

在正常模式下,此正则表达式编写如下。

r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?(?![\d,.])/

如果在正则表达式末尾没有负前瞻,则会获得以下错误结果。

r = /\$\d{1,3}(?:,\d{3})*(?:\.\d{2})?/
"$2,241.31 $2 $1,234 $3,6152 $33.607 $146.27".scan r
  #=> ["$2,241.31", "$2", "$1,234", "$3,615", "$33.60",
  #    "$146.27"]

答案 1 :(得分:1)

[3] pry(main)> str = <<EOF
[3] pry(main)* Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25
[3] pry(main)* Over $2,000,000 add …..………………………$2.00
[3] pry(main)* EOF
=> "Up to $250,000………………………………… $3.90 Over $250,000 to $500,000, add………………$3.70 Over $500,000 to $1,000,000, add……………..$3.40 Over $1,000,000 to $2,000,000, add……...........$2.25\nOver $2,000,000 add …..………………………$2.00\n"
[4] pry(main)> str.scan /\$\d+(?:[,.]\d+)*/
=> ["$250,000", "$3.90", "$250,000", "$500,000", "$3.70", "$500,000", "$1,000,000", "$3.40", "$1,000,000", "$2,000,000", "$2.25", "$2,000,000", "$2.00"]
[5] pry(main)>