我有一个使用命令page.css(“table.vc_result span a”)获得的文件,我无法获取文件的第二个和第三个Span元素:
文件
<table border="0" bgcolor="#FFFFFF" onmouseout="resDef(this)" onmouseover="resEmp(this)" class="vc_result">
<tbody>
<tr>
<td width="260" valign="top">
<table>
<tbody>
<tr>
<td width="40%" valign="top"><span><a class="cAddName" href="/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733">
Gateway Megatech</a></span><br>
<span class="cAddText">P.O. BOX 99682, Chicago IL 60696</span></td>
</tr>
<tr>
<td><span class="cAddText">Cook County Illinois</span></td>
</tr>
<tr>
<td><span class="cAddCategory">Yellow Page Advertising And Telephone
Directory Publica Chicago</span></td>
</tr>
</tbody>
</table>
</td>
<td width="260">
<table align="center">
<tbody>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -0px; width: 16px; height: 16px">
</div>
</td>
<td><font style="font-weight:bold">847-506-7800</font></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td>
<div style=
"background: url('images/listings.png');background-position: -0px -78px; width: 16px; height: 16px">
</div>
</td>
<td><a href=
"/USA/Illinois/Chicago/Yellow+Page+Advertising+And+Telephone+Directory+Publica/gateway-megatech_13478733"
class="cAddNearby">Businesses near 60696</a></td>
</tr>
</tbody>
</table>
</td>
</tr>
<tr>
<td>
<table>
<tbody>
<tr>
<td></td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
</td>
</tr>
</tbody>
</table>
...这不是完整的文件,该文件中有更多的span条目。
我正在使用的代码能够找到确切的文本,但无法将其与嵌套元素Span A的文本相关联。
require 'rubygems'
require 'nokogiri'
require 'open-uri'
name="yellow"
city="Chicago"
state="IL"
burl="http://www.sitename.com/"
url="#{burl}Business_Listings.php?name=#{name}&city=#{city}&state=#{state}¤t=1&Submit=Search"
page = Nokogiri::HTML(open(url))
rows = page.css("table.vc_result span a")
rows.each do |arow|
if arow.text == "Gateway Megatech"
puts(arow.next_element.text)
puts("Capturing the next span text")
found="Got it"
break
else
puts("Found nothing")
found="None"
end
end
答案 0 :(得分:2)
假设每个商家都是您提供的顶级表格中的新<tr>
,以下代码会为您提供一个包含值的哈希数组:
require 'nokogiri'
doc = Nokogiri.HTML(html)
business_rows = doc.css('table.vc_result > tbody > tr')
details = business_rows.map do |tr|
# Inside the first <td> of the row, find a <td> with a.cAddName in it
business = tr.at_xpath('td[1]//td[//a[@class="cAddName"]]')
name = business.at_css('a.cAddName').text.strip
address = business.at_css('.cAddText').text.strip
# Inside the second <td> of the row, find the first <font> tag
phone = tr.at_xpath('td[2]//font').text.strip
# Return a hash of values for this row, using the capitalization requested
{ Name:name, Address:address, Phone:phone }
end
p details
#=> [
#=> {
#=> :Name=>"Gateway Megatech",
#=> :Address=>"P.O. BOX 99682, Chicago IL 60696",
#=> :Phone=>"847-506-7800"
#=> }
#=> ]
这非常脆弱,但适用于你所提供的内容,并且似乎没有太多的语义项可以挂在这种疯狂的恐怖滥用HTML中。
答案 1 :(得分:0)
使用正则表达式解析HTML是一个坏主意,因为HTML不是常规语言。理想情况下,您希望将DOM / XML解析为树结构。
http://nokogiri.org/非常受欢迎。