Ruby - Scraper连接字符串

时间:2016-05-21 00:45:50

标签: arrays ruby web-scraping concatenation scraper

我正在制作一个Ruby web scraper来收集一些信息。 在我想要抓取的页面的HTML中,每篇文章有3个相等的跨度:

<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a> 
            <div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
            <span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span> 
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a> 
            <div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
            <span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a> 
            <div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
            <span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span> 
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>

然而,有些文章没有最后一个跨度(&#34;更多细节&#34;)

目前,我一直在使用此代码:

#first loop to find the title
page.css('a.item-link').each do |line|
    puts line.text
end
#Second loop to find the price
page.css('span.item-price').each do |line|
    puts line.text
end
#third loop to find the details
page.css('span.item-detail').each do |line|
    line.text
end

我使用Nokogiri gem和open-uri来检索和解析文件。

如何连接3个跨度(某些文章在&#34; item-detail&#34;类中只有两个跨度)并在屏幕上打印它们?

我想要的输出是:

title 1
title 2
title 3
200€
300€
500€
T2
T5
T1
20 m²
50 m²
100 m²
more details 1
" "
more details 3

有些文章没有第三个跨度(&#34;更多细节n&#34;)所以如果是这样的话,我会打印&#34; &#34 ;.我的目标是将结果写入.csv文件

1 个答案:

答案 0 :(得分:1)

这是适用于示例输入的代码,尽管我必须稍微修改输入XML以包含在单个HTML节点(<document>)中才能正确解析:

require "nokogiri"

html = <<HTML
<document>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/1" class="item-link " title="title 1" data-xiti-click="listado::enlace">title 1</a>
            <div class="row price-row clearfix"> <span class="item-price">200<span>€</span></span> </div>
            <span class="item-detail">T2 <small></small></span> <span class="item-detail">20 <small>m²</small></span> <span class="item-detail"> <small> more details 1</small></span>
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/2" class="item-link " title="title 2" data-xiti-click="listado::enlace">title 2</a>
            <div class="row price-row clearfix"> <span class="item-price">300<span>€</span></span> </div>
            <span class="item-detail">T5 <small></small></span> <span class="item-detail">50 <small>m²</small></span>
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
<article>
   <div class="item item_contains_branding" data-adid="1234567">
      <div class="clearfix" style="display: block;">
         <div class="item-multimedia ">
            ...
         </div>
         <div class="item-info-container">
            <div class="logo-branding">
            ...
            </div>
                    <a href="/link/3" class="item-link " title="title 3" data-xiti-click="listado::enlace">title 3</a>
            <div class="row price-row clearfix"> <span class="item-price">500<span>€</span></span> </div>
            <span class="item-detail">T1 <small></small></span> <span class="item-detail">100 <small>m²</small></span> <span class="item-detail"> <small> more details 3</small></span>
                <p class="item-description">description...</p>
            <div class="item-toolbar clearfix">
            ...
            </div>
         </div>
      </div>
   </div>
</article>
</document>
HTML

page  = Nokogiri::XML(html)
articles = page.css('article')

articles.each do |article|
  article.css('a.item-link').each do |link|
    puts "#{link[:title]}"
  end
end

articles.each do |article|
  article.css('span.item-price').each do |price|
    puts "#{price.text}"
  end
end

articles.each do |article|
  detail_spans = article.css('span.item-detail')
  puts "#{detail_spans[0].text}"
end

articles.each do |article|
  detail_spans = article.css('span.item-detail')
  puts "#{detail_spans[1].text}"
end

articles.each do |article|
  detail_spans = article.css('span.item-detail')
  puts "#{detail_spans[2] ? detail_spans[2].text.strip : ' '.inspect }"
end

此代码检索article元素的数组,然后使用数组中的每个article元素来限定其中包含的元素的其他查询。这使得能够对各个元素值进行细粒度的报告。

最终的item-detail查询使用元素检测来确定在存在可能不存在的元素的情况下如何输出值。其他查询可能需要这样的技术,具体取决于实际的HTML文档内容。

结果如下:

title 1
title 2
title 3
200€
300€
500€
T2 
T5 
T1 
20 m²
50 m²
100 m²
more details 1
" "
more details 3