我正在尝试从网页上的几百张图片中抓取“alt”标签,然后将它们输出到CSV文件。这基本上是我想要抓取的整个HTML块:
<div class="product-card"
id="product-35492907"
data-element="product-card"
data-owner="some-data-owner"
data-product-slug="some-data-product-slug"
data-product_id="35492907"
data-stock-status="available"
data-icon-enabled="false"
data-retailer-id="2248">
<a class="product-card-image-link"
href="some href"
data-lead-popup
data-lead-popup-url="/track/lead/21716944/?ctx=2383"
>
<img class="product-card-image draggable"
data-pin-no-hover="true"
src="some src"
data-height="250" data-width="200"
height="250" width="200"
alt="SCRAPE ME" # <<<<< here's the guy I'm after
data-product_id="35492907"
/>
</a>
以下是我用来刮取元素的一些代码:
require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'
url = "http://www.example.com/page"
page = Nokogiri::HTML(open(url))
CSV.open("productResults.csv", "wb") do |csv|
page.css('.product-card-image draggable').each do |scrape| #???
alt_name = scrape.at_css('alt').text #???
scrapedProducts = "#{alt_name}"
csv << [scrapedProducts]
end
end
答案 0 :(得分:0)
从简单开始,必要时变得更加复杂:
require 'nokogiri'
require 'csv'
page = Nokogiri::HTML::DocumentFragment.parse(<<EOT)
<div class="product-card"
id="product-35492907"
data-element="product-card"
data-owner="some-data-owner"
data-product-slug="some-data-product-slug"
data-product_id="35492907"
data-stock-status="available"
data-icon-enabled="false"
data-retailer-id="2248">
<a class="product-card-image-link"
href="some href"
data-lead-popup
data-lead-popup-url="/track/lead/21716944/?ctx=2383"
>
<img class="product-card-image draggable"
data-pin-no-hover="true"
src="some src"
data-height="250" data-width="200"
height="250" width="200"
alt="SCRAPE ME" # <<<<< here's the guy I'm after
data-product_id="35492907"
/>
</a>
EOT
搜索相应的<img>
代码并输出其'alt'
参数的值:
page.css('img.product-card-image').each do |img|
puts img['alt']
end
# >> SCRAPE ME
修改它以输出到CSV文件:
CSV.open("productResults.csv", "wb") do |csv|
page.css('img.product-card-image').each do |img|
csv << [img['alt']]
end
end