如何使用Ruby的扫描方法来解析HTML表?

时间:2011-02-19 21:22:35

标签: html ruby regex

我正在尝试获取HTML表并创建一个数组数组,每个数组都是一行,并且数组中的每个元素都是一个单元格。假设我可以将整个表分成行,我想用<td>标记分割每一行。我有以下内容:

def get_cells(one_row)
cells = one_row.scan(/<td>.+?<\/td>/)
for c in cells 
    puts c
end
end

这是我正在处理的HTML,名为one_row的字符串:

<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" />&#160;</span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>

但是,当我在此调用get_cells时,它不会返回包含五个元素的数组。它返回一个包含四个元素的数组:

<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>

似乎正在跳过应该是第四个细胞的东西。该单元格包含许多元素,所有元素都以换行符分隔。这可能是什么搞乱了这个?关于如何处理这个的任何建议?

4 个答案:

答案 0 :(得分:5)

HTML超出了正则表达式的可靠解析能力 - 即使在简单的caes中,它也几乎不值得花时间。如果您需要解析HTML,只需使用像Hpricot或Nokogiri这样的HTML解析器。例如,Nokogiri(text).css('td').count给出5,Nokogiri(text).css('td').map(&:text)给出["1990", "1991", "Gulf War", " Kuwait  United States  Saudi Arabia  United Kingdom  Egypt  France  Syria  Morocco  Oman  Pakistan  Canada Other Coalition Forces", " Iraq"]

答案 1 :(得分:2)

是的,这是换行符。默认情况下,.(点)元字符与它们不匹配,但您可以通过添加/m(“多行”)修饰符来更改它:

/<td>.+?<\/td>/m

仅供参考,大多数其他正则表达式(Perl,Python,.NET等)称之为“单行”或“点匹配全部”模式,并使用/s。他们使用/m修饰符更改^$锚点的含义,允许它们在行边界处匹配,而不仅仅在文本的开头和结尾处匹配。在Ruby中,^$ 总是以这种方式工作,因此不需要单独的模式。

答案 2 :(得分:1)

在处理XML或HTML时,解析器总是更好的方法,除了最简单的工作之外的其他任何工作。

Nokogiri是我的首选解析器。它支持XPath表达式和CSS访问器。 CSS通常会导致搜索更简单,并且对于编写CSS的人来说更为熟悉。 XPath更具表现力,可以在解析器(Nokogiri案例中的libxml2)中进行一些非常惊人的搜索,它可以取代很多Ruby代码。

以下是我如何处理您的数据:

html = <<EOT
<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" />&#160;</span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" />&#160;</span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" />&#160;</span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>
EOT

require 'nokogiri'
require 'pp'

doc = Nokogiri::HTML(html)

# for Ruby 1.8.7+
data = doc.css('tr').map { |tr| tr.css('td').map { |td| td.text } } 

# for Ruby 1.9+
data = doc.css('tr').map { |tr| tr.css('td').map(&:text) } 

# or using XPath
data = doc.search('//tr').map { |tr| tr.search('td').map { |td| td.text } } 

pp data
# >> [["1990",
# >>   "1991",
# >>   "Gulf War",
# >>   " Kuwait United States Saudi Arabia United Kingdom Egypt France Syria Morocco Oman Pakistan CanadaOther Coalition Forces",
# >>   " Iraq"]]

答案 3 :(得分:0)

我会尝试NokogiriSelectorGadget。这是一个很好的视频,展示了如何在http://railscasts.com

进行此操作