我用ruby构建了一个小型的网络抓取应用程序,借此从网站上抓取数据,然后将其存储在csv文件中。我正在成功地抓取和存储所有内容,但是无法以“表”格式构造csv文件,因为该格式有两列和多行。我的csv文件应具有一个名称列和一个价格列,以及每个产品的名称和价格。这是我的代码:
require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/eiwittenwhey/whey-proteine/?limit=81&manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_names = parsed_page.css('div.product-primary')
name = Array.new
product_names.each do |product_name|
name << product_name.css('h2.product-name').text
end
product_prices = parsed_page.css('div.price-box')
price = Array.new
product_prices.each do |product_price|
price << product_price.css('span.price').text
end
headers = ["name", "price"]
item = [name, price]
CSV.open('data/wheyprotein.csv', 'w', :col_sep => "\t|", :headers => true) do |csv|
csv << headers
item.each {|row| csv << row }
end
byebug
end
whey_scrapper
每次迭代后我都会创建一行,但是csv文件仍然非常混乱且结构混乱。
这是我的csv文件的外观:
name |price
-----------------
"
Whey Perfection Body & fit
" |"
Whey Perfection® bestseller box Body & fit
" |"
Whey Perfection - Special Series Body & fit
" |"
Isolaat Perfection Body & fit
" |"
Perfect Protein Body & fit
" |"
Whey Isolaat XP Body & fit
" |"
Micellar Casein Perfection Body & fit
" |"
Low Calorie Meal Body & fit
" |"
Whey Breakfast Body & fit
" |"
Whey Perfection - Flavour Box Body & fit
" |"
Protein Breakfast Body & fit
" |"
Whey Perfection Summer Box Body & fit
" |"
Puur Whey Body & fit
" |"
Whey Isolaat Crispy Body & fit
" |"
Vegan Protein voordeel Body & fit vegan
" |"
Whey Perfection Winter Box Body & fit
" |"
Sports Breakfast Body & fit
"
€ 7,90 |€ 9,90 |€ 11,90 |€ 17,90 |€ 31,90 |€ 18,90 |€ 12,90 |€ 6,90 |€ 6,90 |€ 10,90 |€ 15,90 |€ 9,90 |€ 26,90 |€ 6,90 |€ 24,90 |€ 9,90 |€ 20,90
答案 0 :(得分:1)
首先-产品名称。您正在从HTML中获取太多信息。 h2元素包含空格和span元素,在其中可能应该忽略它们。您可以这样做:
product_names.each do |product_name|
name << product_name.css('h2.product-name a').children[0].text.gsub(/\s{2,}/, '')
end
然后,CSV需要将每一行作为包含多个项目的数组传递。在您的情况下,应该有很多包含两个项目(产品名称和价格)的数组。为此,您可以简单地压缩两个表:
items = name.zip(price)
然后创建CSV文件:
CSV.open('data/wheyprotein.csv', 'w') do |csv|
csv << headers
items.each {|row| csv << row }
end
完整方法如下:
def whey_scrapper
company = 'Body+%26+fit'
url = "https://www.bodyenfitshop.nl/eiwittenwhey/whey-proteine/?limit=81&manufacturer=#{company}"
unparsed_page = open(url).read
parsed_page = Nokogiri::HTML(unparsed_page)
product_names = parsed_page.css('div.product-primary')
name = Array.new
product_names.each do |product_name|
name << product_name.css('h2.product-name a').children[0].text.gsub(/\s{2,}/, '')
end
product_prices = parsed_page.css('div.price-box')
price = Array.new
product_prices.each do |product_price|
price << product_price.css('span.price').text
end
headers = ["name", "price"]
items = name.zip(price)
CSV.open('data/wheyprotein.csv', 'w+') do |csv|
csv << headers
items.each {|row| csv << row }
end
end