Ruby Nokogiri在span标签内解析文本

时间:2015-02-05 07:48:25

标签: ruby parsing nokogiri

我还是Ruby的新手,想要一些帮助解析Nokogiri的数据。我想构建一个应用程序,显示来自热门场所的艺术家,并希望仅提取艺术家的名字。这是我的代码:

data.css('.headliner').each do |artist|
puts artist
end

目前正在返回:

<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>

某些元素有多个span标记,但我无法获取所需的数据。我想要归还的只有艺术家的名字,如伦敦语法学校,“霍齐尔”,“贝尔霍华德”和“博士”。狗&#39;

目前,当我运行artist.text时,它会返回&#34;重新安排的DateLondon语法&#34;等等...

希望我能很好地解释自己。任何帮助表示赞赏,谢谢!


&#13;
&#13;
<table class="concert_calendar" cellspacing="0" width="720" style="margin-top:35px;">
    <tbody><tr><td class="noborder"><img src="images/title_date2.gif" alt="Date"></td>
    	<td class="noborder" colspan="2"><img src="images/title_show2.gif" alt="Show"></td>
        <td class="noborder"><img src="images/title_time2.gif" alt="Time"></td>
        <td class="noborder"><img src="images/title_tickets2.gif" alt="Tickets"></td></tr>
    <tr><td colspan="5" class="noborder"><hr size="1" color="#550818" noshade="" style="margin:0px; padding:0px;"></td></tr>
		<tr><td style="width:100px;" class="">Saturday,<br>February 7</td>
    	<td style="width:115px;" valign="top" class=""><a href="popartist.php?cID=4600&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" class="con_img thickbox"><img src="http://www.apeconcerts.com/concertimages/LondonGrammar_100.jpg" alt="London Grammar"></a></td>
        <td valign="top" style="width:345px; padding-right:10px;" class="">
        	<a href="popartist.php?cID=4600&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" style="text-decoration:none;" class="thickbox">
            	<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span></a>
        	<div><span class="warmup">Until The Ribbon Breaks</span><br>
            <span class="warmup"></span></div></td>
        <td style="width:80px;">show<br>8:00PM</td>
        <td style="width:80px;">
        <img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!">        </td></tr>
		<tr><td style="width:100px;">Tuesday,<br>February 10</td>
    	<td style="width:115px;" valign="top"><a href="popartist.php?cID=4733&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" class="con_img thickbox"><img src="http://www.apeconcerts.com/concertimages/Hozier_1001.jpg" alt="Hozier"></a></td>
        <td valign="top" style="width:345px; padding-right:10px;" class="">
        	<a href="popartist.php?cID=4733&amp;KeepThis=true&amp;TB_iframe=true&amp;height=600&amp;width=475" style="text-decoration:none;" class="thickbox">
            	<span class="headliner">Hozier</span></a>
        	<div class=""><span class="warmup">Ásgeir</span><br>
            <span class="warmup"></span></div></td>
        <td style="width:80px;">show<br>8:00PM</td>
        <td style="width:80px;">
        <img src="images/cal_soldout.gif" alt="SOLD OUT - Thank you!">        </td></tr>
&#13;
&#13;
&#13;

2 个答案:

答案 0 :(得分:2)

  

我想要的只是艺术家的名字,如'London Grammar',   'Hozier','Ben Howard'和'博士狗'

这是一种方式:

require 'nokogiri'

html = %q{
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
}

html_doc = Nokogiri::HTML(html)
headliners = html_doc.css('.headliner')

headliners.each do |headliner|
  headliner.css('i').each do |i|
    i.content = ''
  end

  puts headliner.text
end

--output:--
London Grammar
Hozier
Ben Howard
Dr. Dog

答案 1 :(得分:-1)

如果你要做的就是删除<i>标记的内容,那么只需完全删除标记:

require 'nokogiri'

doc = Nokogiri::HTML(<<EOT)
<span class="headliner"><span class="prepend"><i>Rescheduled Date</i></span><br>London Grammar</span>
<span class="headliner">Hozier</span>
<span class="headliner"><span class="prepend"><i>KFOG presents</i></span><br>Ben Howard<br><span class="append"><i>with special guest</i><br></span></span>
<span class="headliner">Dr. Dog</span>
EOT

doc.search('.headliner i').map(&:remove)
puts doc.to_html
# >> <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN" "http://www.w3.org/TR/REC-html40/loose.dtd">
# >> <html><body>
# >> <span class="headliner"><span class="prepend"></span><br>London Grammar</span>
# >> <span class="headliner">Hozier</span>
# >> <span class="headliner"><span class="prepend"></span><br>Ben Howard<br><span class="append"><br></span></span>
# >> <span class="headliner">Dr. Dog</span>
# >> </body></html>

此时,迭代.headliner标签并输出其内容非常容易:

puts doc.search('.headliner').map(&:text)

# >> London Grammar
# >> Hozier
# >> Ben Howard
# >> Dr. Dog

对于包含许多匹配.headliner的标签的大页面,我可能会有所不同,但这对于普通页面来说已经足够了。

另请参阅“How to avoid joining all text from Nodes when scraping”。