我想解析一个HTML文件,提取我的研究中使用的相关数据。这是HTML的一部分:
<td class="color_line1" valign="center"><a class="linkpadrao" href="javascript:Direciona('5453*SERRA@TALHADA');">Serra Talhada</a></td>
<td class="color_line" valign="center" align="center">9</td>
<td class="color_line" valign="center" align="center">2,973</td>
<td class="color_line" valign="center" align="center">0,016</td>
<td class="color_line" valign="center" align="center">2,939</td>
<td class="color_line" valign="center" align="center">3,000</td>
<td class="color_line" valign="center" align="center">0,572</td>
<td class="color_line" valign="center" align="center">2,401</td>
<td class="color_line" valign="center" align="center">0,024</td>
<td class="color_line" valign="center" align="center">2,378</td>
<td class="color_line" valign="center" align="center">2,426</td>
</tr>
更具体一点,我想获得“Serra Talhada”(作为城市名称),以及城市名称下面的所有数字(这是天然气的最高,最低和平均价格)。
到目前为止我试过这个:
require 'rubygems'
require 'mechanize'
require 'nokogiri'
require 'open-uri'
url = "http://www.anp.gov.br/preco/prc/Resumo_Por_Estado_Municipio.asp"
agent = Mechanize.new
parameters = {'selSemana' => '737*De+28%2F07%2F2013+a+03%2F08%2F2013',
'desc_semana' => 'de+28%2F07%2F2013+a+03%2F08%2F2013',
'cod_Semana' => '737',
'tipo' => '1',
'Cod_Combustivel' => 'undefined',
'selEstado' => 'PE*PERNAMBUCO',
'selCombustivel' => '487*Gasolina',
}
municipio = []
page = agent.post(url, parameters)
extrair = page.parser
extrair.css('.linkpadrao').each do |posto|
# Municipios
municipio << posto.text
end
我无法弄清楚如何获取数字,因为它们具有相同的HTML结构。
有什么想法吗?!
答案 0 :(得分:2)
由于您需要找到与城市链接相关的单元格,您应该找到它们共同的祖先 - 在这种情况下是他们的tr。
使用xpath,您可以按文字找到特定的单元格:
# This is the table that contains all of the city data
data_table = extrair.at_css('.table_padrao')
# This is the specific row that contains the specified city
row = data_table.xpath('//tr[td/a[@class="linkpadrao" and text()="Serra Talhada"]]')
# This is the data in the specific row
data = row.css(".color_line").map{|e| e.text }
#=> ["9", "2,973", "0,016", "2,939", "3,000", "0,572", "2,401", "0,024", "2,378", "2,426"]
答案 1 :(得分:1)
您可以通过以下方式获取每个帖子后面的数字:
posto.parent.search('~ td').map &:text