获取属性html元素

时间:2013-09-07 20:24:46

标签: ruby css-selectors nokogiri

我正在尝试从this site获取包含MMEL代码内容的表格,我正在尝试使用CSS选择器完成它。

到目前为止我得到的是:

require_relative 'sources/Downloader'
require 'nokogiri'

html_content = Downloader.download_page('http://www.s-techent.com/ATA100.htm')
parsed_html = Nokogiri::HTML(html_content)

tmp = parsed_html.css("tr[*]")

puts tmp.text

我在尝试使用属性获取此tr时遇到错误。我如何完成此任务以简单的形式获取此表,因为我想将其解析为JSON。很高兴去分段并在.each块中调用它。


修改 如果我能像这样得到一些东西(查看页面来源)

,我会很自豪
<TR><TD WIDTH="10%" VALIGN="TOP" ROWSPAN=5>
<B><FONT FACE="Arial" SIZE=2><P ALIGN="CENTER">11</B></FONT></TD>
<TD WIDTH="40%" VALIGN="TOP"  COLSPAN=2>
<B><FONT FACE="Arial" SIZE=2><P>PLACARDS AND MARKINGS</B></FONT></TD>
<TD WIDTH="50%" VALIGN="TOP">
<FONT FACE="Arial" SIZE=2><P ALIGN="LEFT">All procurable placards, labels, etc., shall be included in the illustrated Parts Catalog.  They shall be illustrated, showing the part number, Legend and Location.  The Maintenance Manual shall provide the approximate Location (i.e., FWD -UPPER -RH) and illustrate each placard, label, marking, self -illuminating sign, etc., required for safety information, maintenance significant information or by government regulations.  Those required by government regulations shall be so identified.</FONT></TD>
</TR>

2 个答案:

答案 0 :(得分:1)

这应该在第96行打印来自源代码的所有TR。该页面中有三个表格,table[1]包含您需要的所有文本:

require 'nokogiri'

doc = Nokogiri::HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.css("table")[1].css("tr").each do |i|
  puts i #=> prints the exact html between TR tags (including)
  puts i.text #=> prints the text
end

例如:

puts doc.css("table")[1].css("tr")[2] 

打印以下内容:

<tr>
<td valign="TOP" colspan="3">
<b><font face="Arial" size="2"><p align="CENTER">GROUP DEFINITION - AIRCRAFT</p></font></b>
</td>
<td valign="TOP">
<font face="Arial" size="2"><p align="LEFT">The complete operational unit.  Includes dimensions and
areas, lifting and shoring,    leveling and weighing, towing and taxiing, parking and mooring, requi
red placards, servicing.</p></font>
</td>
</tr>

答案 1 :(得分:1)

您也可以使用xpath执行相同操作:

以下是OP发布的网页第一个表格中的内容:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
doc.xpath('(//table)[1]/tr').each do |tr|
  puts tr.to_html(:encoding => 'utf-8')
end

输出:

  <tr>
  <td width="33%" valign="MIDDLE" colspan="2">
  <p><img src="S-Tech-Logo-Blue2.gif" width="274" height="127"></p>
  </td>
  <td width="67%" valign="MIDDLE">
  <b><i><font face="Arial" color="#0000ff">
  <p align="CENTER"><big>AIRCRAFT PARTS MANUFACTURING ASSISTANCE (PMA)</big><br><big>DAR SERVICES</big></p></font></i></b>
  </td>
  </tr>

现在,如果要收集最后一个表行,请执行:

require 'nokogiri'
require 'open-uri'

doc = Nokogiri.HTML(open('http://www.s-techent.com/ATA100.htm'))
p doc.xpath('(//table)[3]/tr').to_a.size # => 1
doc.xpath('(//table)[3]/tr').each do |tr|
  puts tr.to_html(:encoding => 'utf-8')
end

输出:

<tr>
<td width="40%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">149 AZALEA CIRCLE • LIMERICK, PA 19468-1330</font></b></p>
</td>
<td width="30%" valign="TOP" height="10">
<p align="CENTER"><b><font face="Arial" size="2" color="#0000ff">610-495-6898 (Office) • 484-680-0507 (Cell)</font></b></p>
</td>
<td width="110%" valign="TOP" height="10">
<p align="CENTER"><a href="Contact.htm"><b><font face="Arial" size="2">E-mail S-Tech</font></b></a></p>
</td>
</tr>