我构建了一个Web抓取工具,该抓取工具可以成功地将所需的所有内容从正在查看的网页中拉出几乎。目标是为与在特定URL处找到的所有咖啡相关联的特定图像提取URL。
我定义用来完成抓取的rake任务如下:
mechanize = Mechanize.new
mechanize.get(url) do |page|
page.links_with(:href => /products/).each do |link|
coffee_page = link.click
bean = Bean.new
bean.acidity = coffee_page.css('[data-id="acidity"]').text.strip.gsub("acidity ","")
bean.elevation = coffee_page.css('[data-id="elevation"]').text.strip.gsub("elevation ","")
bean.roaster_id = "2"
bean.harvest_season = coffee_page.css('[data-id="harvest"]').text.strip.gsub("harvest ","")
bean.price = coffee_page.css('.price-wrap').text.gsub("$","")
bean.roast_profile = coffee_page.css('[data-id="roast"]').text.strip.gsub("roast ","")
bean.processing_type = coffee_page.css('[data-id="process"]').text.strip.gsub("process ","")
bean.cultivar = coffee_page.css('[data-id="cultivar"]').text.strip.gsub("cultivar ","")
bean.flavor_profiles = coffee_page.css('.price-wrap+ p').text.strip
bean.country_of_origin = coffee_page.css('#pdp-order h1').text.strip
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
if bean.country_of_origin == "Origin Set" || bean.country_of_origin == "Gift Card (online use only)"
bean.destroy
else
ap bean
end
end
end
现在我需要的信息全部在页面上,并且我正在寻找类似于以下内容找到的图像URL,但要查找源页面上所有单独的coffee_pages。它必须足够通用才能提取此图片来源,但仅此而已。我尝试了许多不同的CSS选择器,但所有操作都拉为nil或为空白。
<img src="//cdn.shopify.com/s/files/1/2220/0129/products/ceremony-product-gummy-bears_480x480.jpg?v=1551455589" alt="Burundi Kiryama" data-product-featured-image style="display:none">
我正在使用的coffee_page在这里:https://shop.ceremonycoffee.com/products/burundi-kiryama
答案 0 :(得分:0)
您需要更改
bean.image_url = coffee_page.css('img data-featured-product-image').attr('src')
到
bean.image_url = coffee_page.css('#mobile-only>img').attr('src')
如果可以的话,请始终使用附近的标识符来定位要访问的元素。