正则表达式解析使用Nokogiri

时间:2010-07-17 18:07:56

标签: ruby regex nokogiri

使用Nokogiri,我需要解析给出的块:

<div class="some_class">
  12 AB / 4+ CD
  <br/>
  2,600 Dollars
  <br/> 
</div>

我需要获取abcddollars值(如果存在)。

ab = p.css(".some_class").text[....some regex....]
cd = p.css(".some_class").text[....some regex....]
dollars = p.css(".some_class").text[....some regex....]

这是对的吗?如果是这样,有人可以帮助我使用正则表达式来解析abcddollars值吗?

2 个答案:

答案 0 :(得分:6)

要获得更好的答案,您必须明确说明AB,CD和Dollar值的确切格式,但这是基于给定示例的解决方案。它使用正则表达式分组()来捕获我们感兴趣的信息。(有关详细信息,请参阅答案底部)

text = p.css(".some_class").text

# one or more digits followed by a space followed by AB, capture the digits
ab = text.match(/(\d+) AB/).captures[0] # => "12"

# one of more non digits followed by a literal + followed by CD
cd = text.match(/(\d+\+) CD/).captures[0] # => "4+"

# digits or commas followed by "Dollars"
dollars = text.match(/([\d,]+) Dollars/).captures[0] # => "2,600"

请注意,如果没有匹配,则String#match会返回nil,因此如果值可能不存在,则需要进行检查,例如。

if match = text.match(/([\d,]+) Dollars/)
  dollars = match.captures[0]
end

捕获的其他说明

为了匹配AB的数量,我们需要一个模式/\d+ AB/来识别文本的正确部分。但是,我们真的只对数字部分感兴趣所以我们用括号括起来,以便我们可以提取它。 e.g。

irb(main):027:0> match = text.match(/(\d+) AB/)
=> #<MatchData:0x2ca3440>           # the match method returns MatchData if there is a match, nil if not
irb(main):028:0> match.to_s         # match.to_s gives us the entire text that matched the pattern
=> "12 AB"
irb(main):029:0> match.captures     
=> ["12"]
# match.captures gives us an array of the parts of the pattern that were enclosed in ()
# in our example there is just 1 but there could be multiple
irb(main):030:0> match.captures[0]
=> "12"                             # the first capture - the bit we want

请查看MatchData的文档,特别是captures方法,了解更多详情。

答案 1 :(得分:0)

这是一个较旧的帖子,但我偶然发现了它。以下是我如何定位值以及存储值的可用方法:

require "ap"
require "nokogiri"

xml = <<EOT
<div class="some_class">
  12 AB / 4+ CD
  <br/>
  2,600 Dollars
  <br/> 
</div>
EOT

doc = Nokogiri::XML(xml)

some_class = doc.at('.some_class').text

values = some_class
  .scan(/([\d+]+) ([a-z,]+)/i)
  .each_with_object({}){ |(v,c), h| h[c] = v.to_i }

values # => {"AB"=>12, "CD"=>4, "Dollars"=>600}