刮图像" alt"标记并导出为CSV

时间:2014-07-18 16:06:56

标签: ruby-on-rails ruby nokogiri export-to-csv

我正在尝试从网页上的几百张图片中抓取“alt”标签,然后将它们输出到CSV文件。这基本上是我想要抓取的整个HTML块:

<div class="product-card"
 id="product-35492907"
 data-element="product-card"
 data-owner="some-data-owner"
 data-product-slug="some-data-product-slug"
 data-product_id="35492907"
 data-stock-status="available"
 data-icon-enabled="false"
 data-retailer-id="2248">

<a  class="product-card-image-link"
    href="some href"


            data-lead-popup
            data-lead-popup-url="/track/lead/21716944/?ctx=2383"


>
    <img class="product-card-image draggable"
         data-pin-no-hover="true"

            src="some src"
            data-height="250" data-width="200"
            height="250" width="200"

         alt="SCRAPE ME"                      # <<<<< here's the guy I'm after
         data-product_id="35492907"
    />

</a>

以下是我用来刮取元素的一些代码:

require 'rubygems'
require 'nokogiri'   
require 'open-uri'
require 'csv'

url = "http://www.example.com/page"
page = Nokogiri::HTML(open(url))

CSV.open("productResults.csv", "wb") do |csv|
  page.css('.product-card-image draggable').each do |scrape|   #???  
    alt_name = scrape.at_css('alt').text                         #???  
    scrapedProducts = "#{alt_name}"

    csv << [scrapedProducts]
  end
end

1 个答案:

答案 0 :(得分:0)

从简单开始,必要时变得更加复杂:

require 'nokogiri'   
require 'csv'

page = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<div class="product-card"
 id="product-35492907"
 data-element="product-card"
 data-owner="some-data-owner"
 data-product-slug="some-data-product-slug"
 data-product_id="35492907"
 data-stock-status="available"
 data-icon-enabled="false"
 data-retailer-id="2248">

<a  class="product-card-image-link"
    href="some href"


            data-lead-popup
            data-lead-popup-url="/track/lead/21716944/?ctx=2383"


>
    <img class="product-card-image draggable"
         data-pin-no-hover="true"

            src="some src"
            data-height="250" data-width="200"
            height="250" width="200"

         alt="SCRAPE ME"                      # <<<<< here's the guy I'm after
         data-product_id="35492907"
    />

</a>
EOT

搜索相应的<img>代码并输出其'alt'参数的值:

page.css('img.product-card-image').each do |img|
  puts img['alt']
end
# >> SCRAPE ME

修改它以输出到CSV文件:

CSV.open("productResults.csv", "wb") do |csv|
  page.css('img.product-card-image').each do |img|
    csv << [img['alt']]
  end
end