用Nokogiri在标签之间提取?

时间:2013-01-07 22:19:40

标签: html ruby parsing nokogiri

我正在尝试使用Nokogiri从site中提取电话号码和地址。它们都在<br>个标签之间。我怎么能这样做?


如果网站关闭,这里是一些HTML的摘录,我希望从中提取电话号码和地址:

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Alana's Cafe</strong><br>
<em>Cafe/Desserts </em>
<br>
650 348-0417
<br>
1408 Burlingame Ave
<br>
<a href="http://www.alanascafe.com/burlingame.html" target="_blank">http://www.alanascafe.com/burlingame.html</a>

</td><td align="right">
<a href="index.cfm?vid=44885" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

<table width="900" style=" margin:8px; padding:5px; font-family:Verdana, Geneva, sans-serif; font-size:12px; line-height:165%; color:#333333; border-bottom:1px solid #cccccc; "><tbody><tr valign="top"><td>
<strong>Amber Moon Indian Restaurant and Bar</strong><br>
<em>Indian </em>

<br>
1425 Burlingame Ave


</td><td align="right">
<a href="index.cfm?vid=44872" style="text-decoration:none; color:black">
<img src="iconmap.png" height="30" border="0"><br>
Map</a></td></tr></tbody></table>

2 个答案:

答案 0 :(得分:2)

最简单的是:

data = doc.search('em').map{|em| em.search('~ br').map{|br| br.next.text.strip}}
#=> [["650 348-0417", "1408 Burlingame Ave", "http://www.alanascafe.com/burlingame.html"], etc...

这意味着:对于每个em,在每个兄弟br元素之后映射文本。

<强>更新

要将其分类到电话/地址,您可以这样做:

data.map{|row| {:phone => row[0][/^[\d \(\)-]+$/] ? row.shift : nil, :address => row.shift}}
#=> [{:phone=>"650 348-0417", :address=>"1408 Burlingame Ave"}, etc...

答案 1 :(得分:1)

代码

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://map.burlingamedowntown.org/textdir.cfm?p=1213'))
addresses = doc.xpath('//td[strong][em]/br[3]/following-sibling::text()[1]')
p addresses.map(&:text).map(&:strip)
#=> #=> ["1408 Burlingame Ave", "347 Primrose Rd", "305 California Dr", "1409 Burlingame Avenue", "260 Lorton Ave", "1219 Burlingame Avenue", "1108 Burlingame Avenue", "1212 Donnelly Ave", "1243 Howard Ave", "283 Lorton Avenue", "245 California Drive", "1107 Howard Ave", "1300 Howard Ave", "1216 Burlingame Avenue", "1310 Burlingame Ave", "322 Lorton Avenue", "203 Primrose Dr", "1125 Burlingame Avenue", "327 Lorton Avenue", "1451 Burlingame Ave", "221 Primrose Rd", "1101 Burlingame Ave", "", "1123 Burlingame Avenue", "1407 Burlingame Ave", "1318 Burlingame Avenue", "1213 Burlingame Avenue", "231 Park Road", "246 Lorton Ave", "1453 Burlingame Ave", "1309 Burlingame Avenue", "321 Primrose Road", "", "209 Park Road", "1207 Burlingame Avenue", "1090 Burlingame Avenue", "1223 Donnelly Ave", "243 California Dr", "1080 Howard Ave", "270 Lorton Ave", "1447 Burlingame Ave", "361 California Drive", "1160 Burlingame Avenue", "333 California Drive", "401 Primrose Road", "1100 Burlingame Avenue", "1100Howard Ave #D", "1309 Burlingame Avenue", "220 Lorton Ave", "", "1101 Howard Avenue", "266 Lorton Avenue", "240 Park Rd", "1118 Burlingame Ave", "221 Park Road", "1400 Howard Ave", "225 Primrose Road", "248 Lorton Avenue"]

如何运作

由于HTML没有语义标记,第一个挑战是只查找带有地址的条目。查看源代码,我们知道它们位于页面上的<td>中,因此我们从那开始:

  • //td - 在文档中的任意位置查找<td> ...

然而,这个页面充满了bad markup,所以我们需要将搜索范围限制在正确的表格单元格中。在这种情况下,<strong><em>标记在每个条目中使用一致,并且不会出现在任何其他不需要的单元格中:

  • //td[strong][em] - ...但请确保<td>至少有一个<strong>和至少一个<em>子元素...

现在,我们希望在第三个<br>元素之后显示文字,因此首先我们只选择每个匹配<br>的第三个<td>子项:

  • //td[strong][em]/br[3] - ...然后找到子<br>元素,只挑选第三个...

然后我们得到第一个文本节点<br>

  • //td[strong][em]/br[3]/following-sibling::text()[1] - ...找到<br>的所有后来的兄弟文本节点,然后只选择第一个。

这给我们留下了一组Nokogiri::XML::Text个实例,因此我们将这个数组映射为每个实例的字符串文本,最后我们将那个数组映射到已剥离的数组任何前导和尾随空格。这不是最快的方法,但它既简洁又清晰,而且速度足够快。

为电话号码做类似的事情留给读者练习。


修改:这是一个稍微强一些的变体,足以处理没有电话号码的条目:

# Make all the `<br>` be real "\r\n".
doc.xpath('//td[strong][em]/br').each{ |br| br.replace("\r\n") }

# Get the text inside each entry
entries = doc.xpath('//td[strong][em]').map(&:text)

# Change the multi-line string into an array of lines
entries = entries.map{ |entry| entry.strip.split(/(?:\r\n)+/).map(&:strip) }

# Find the first line in each that has no letters in it
phones = entries.map{ |entry_lines| entry_lines.grep(/^[^a-z]+$/i).first }

# Find the first line in each that has a string of digits followed by a letter
addresses = entries.map{ |entry_lines| entry_lines.grep(/\d+ [a-z]/i).first }

# Zip and iterate them together
phones.zip(addresses).each do |phone,address|
  puts "For %s call %s" % [address,phone || "-"]
end

#=> For 1408 Burlingame Ave call 650 348-0417
#=> For 1425 Burlingame Ave call -
#=> For 347 Primrose Rd call 650-548-0300
#=> For 305 California Dr call 650 340-8642
#=> For 1409 Burlingame Avenue call 650 348-1204
#=> ...
相关问题