我正在尝试获取HTML表并创建一个数组数组,每个数组都是一行,并且数组中的每个元素都是一个单元格。假设我可以将整个表分成行,我想用<td>
标记分割每一行。我有以下内容:
def get_cells(one_row)
cells = one_row.scan(/<td>.+?<\/td>/)
for c in cells
puts c
end
end
这是我正在处理的HTML,名为one_row
的字符串:
<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" /> </span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>
但是,当我在此调用get_cells时,它不会返回包含五个元素的数组。它返回一个包含四个元素的数组:
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
似乎正在跳过应该是第四个细胞的东西。该单元格包含许多元素,所有元素都以换行符分隔。这可能是什么搞乱了这个?关于如何处理这个的任何建议?
答案 0 :(得分:5)
HTML超出了正则表达式的可靠解析能力 - 即使在简单的caes中,它也几乎不值得花时间。如果您需要解析HTML,只需使用像Hpricot或Nokogiri这样的HTML解析器。例如,Nokogiri(text).css('td').count
给出5,Nokogiri(text).css('td').map(&:text)
给出["1990", "1991", "Gulf War", " Kuwait United States Saudi Arabia United Kingdom Egypt France Syria Morocco Oman Pakistan Canada Other Coalition Forces", " Iraq"]
。
答案 1 :(得分:2)
是的,这是换行符。默认情况下,.
(点)元字符与它们不匹配,但您可以通过添加/m
(“多行”)修饰符来更改它:
/<td>.+?<\/td>/m
仅供参考,大多数其他正则表达式(Perl,Python,.NET等)称之为“单行”或“点匹配全部”模式,并使用/s
。他们使用/m
修饰符更改^
和$
锚点的含义,允许它们在行边界处匹配,而不仅仅在文本的开头和结尾处匹配。在Ruby中,^
和$
总是以这种方式工作,因此不需要单独的模式。
答案 2 :(得分:1)
在处理XML或HTML时,解析器总是更好的方法,除了最简单的工作之外的其他任何工作。
Nokogiri是我的首选解析器。它支持XPath表达式和CSS访问器。 CSS通常会导致搜索更简单,并且对于编写CSS的人来说更为熟悉。 XPath更具表现力,可以在解析器(Nokogiri案例中的libxml2)中进行一些非常惊人的搜索,它可以取代很多Ruby代码。
以下是我如何处理您的数据:
html = <<EOT
<tr>
<td>1990</td>
<td>1991</td>
<td><a href="/wiki/Gulf_War">Gulf War</a></td>
<td><span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/aa/Flag_of_Kuwait.svg/22px-Flag_of_Kuwait.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/Kuwait">Kuwait</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/a4/Flag_of_the_United_States.svg/22px-Flag_of_the_United_States.svg.png" width="22" height="12" class="thumbborder" /> </span><a href="/wiki/United_States">United States</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/0d/Flag_of_Saudi_Arabia.svg/22px-Flag_of_Saudi_Arabia.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Saudi_Arabia">Saudi Arabia</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/a/ae/Flag_of_the_United_Kingdom.svg/22px-Flag_of_the_United_Kingdom.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/United_Kingdom">United Kingdom</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/f/fe/Flag_of_Egypt.svg/22px-Flag_of_Egypt.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Egypt">Egypt</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/c3/Flag_of_France.svg/22px-Flag_of_France.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/France">France</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/5/53/Flag_of_Syria.svg/22px-Flag_of_Syria.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Syria">Syria</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/2/2c/Flag_of_Morocco.svg/22px-Flag_of_Morocco.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Morocco">Morocco</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/d/dd/Flag_of_Oman.svg/22px-Flag_of_Oman.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/Oman">Oman</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/3/32/Flag_of_Pakistan.svg/22px-Flag_of_Pakistan.svg.png" width="22" height="15" class="thumbborder" /> </span><a href="/wiki/Pakistan">Pakistan</a><br />
<span class="flagicon"><img alt="" src="http://upload.wikimedia.org/wikipedia/commons/thumb/c/cf/Flag_of_Canada.svg/22px-Flag_of_Canada.svg.png" width="22" height="11" class="thumbborder" /> </span><a href="/wiki/Canada">Canada</a><br />
<a href="/wiki/Coalition_of_Gulf_War" title="Coalition of Gulf War" class="mw-redirect">Other Coalition Forces</a></td>
<td><span class="flagicon"><a href="/wiki/Iraq" title="Iraq"><img alt="Iraq" src="http://upload.wikimedia.org/wikipedia/commons/thumb/0/04/Flag_of_Iraq_%281963-1991%29.svg/22px-Flag_of_Iraq_%281963-1991%29.svg.png" width="22" height="15" class="thumbborder" /></a></span> <a href="/wiki/Baathist_Iraq" title="Baathist Iraq">Iraq</a></td>
</tr>
EOT
require 'nokogiri'
require 'pp'
doc = Nokogiri::HTML(html)
# for Ruby 1.8.7+
data = doc.css('tr').map { |tr| tr.css('td').map { |td| td.text } }
# for Ruby 1.9+
data = doc.css('tr').map { |tr| tr.css('td').map(&:text) }
# or using XPath
data = doc.search('//tr').map { |tr| tr.search('td').map { |td| td.text } }
pp data
# >> [["1990",
# >> "1991",
# >> "Gulf War",
# >> " Kuwait United States Saudi Arabia United Kingdom Egypt France Syria Morocco Oman Pakistan CanadaOther Coalition Forces",
# >> " Iraq"]]
答案 3 :(得分:0)
我会尝试Nokogiri和SelectorGadget。这是一个很好的视频,展示了如何在http://railscasts.com
进行此操作