Question

我用ruby构建了一个小型的网络抓取应用程序，借此从网站上抓取数据，然后将其存储在csv文件中。我正在成功地抓取和存储所有内容，但是无法以“表”格式构造csv文件，因为该格式有两列和多行。我的csv文件应具有一个名称列和一个价格列，以及每个产品的名称和价格。这是我的代码：

require 'open-uri'
require 'nokogiri'
require 'httparty'
require 'byebug'
require 'csv'

    def whey_scrapper
        company = 'Body+%26+fit'
        url = "https://www.bodyenfitshop.nl/eiwittenwhey/whey-proteine/?limit=81&manufacturer=#{company}"
        unparsed_page = open(url).read
        parsed_page = Nokogiri::HTML(unparsed_page)
        product_names = parsed_page.css('div.product-primary')
        name = Array.new
        product_names.each do |product_name| 
            name << product_name.css('h2.product-name').text
        end
        product_prices = parsed_page.css('div.price-box')
        price = Array.new
        product_prices.each do |product_price|
            price << product_price.css('span.price').text
        end
        headers = ["name", "price"]
        item = [name, price]
        CSV.open('data/wheyprotein.csv', 'w', :col_sep => "\t|", :headers => true) do |csv|
            csv << headers
            item.each {|row| csv << row }
        end
        byebug
    end   
    whey_scrapper

每次迭代后我都会创建一行，但是csv文件仍然非常混乱且结构混乱。

这是我的csv文件的外观：

name	|price
-----------------
"
                            
                                Whey Perfection                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection® bestseller box                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection - Special Series                                Body & fit
                            
                        "	|"
                            
                                Isolaat Perfection                                Body & fit
                            
                        "	|"
                            
                                Perfect Protein                                Body & fit
                            
                        "	|"
                            
                                Whey Isolaat XP                                Body & fit
                            
                        "	|"
                            
                                Micellar Casein Perfection                                Body & fit
                            
                        "	|"
                            
                                Low Calorie Meal                                Body & fit
                            
                        "	|"
                            
                                Whey Breakfast                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection - Flavour Box                                 Body & fit
                            
                        "	|"
                            
                                Protein Breakfast                                Body & fit
                            
                        "	|"
                            
                                Whey Perfection Summer Box                                Body & fit
                            
                        "	|"
                            
                                Puur Whey                                Body & fit
                            
                        "	|"
                            
                                Whey Isolaat Crispy                                Body & fit
                            
                        "	|"
                            
                                Vegan Protein voordeel                                Body & fit vegan
                            
                        "	|"
                            
                                Whey Perfection Winter Box                                Body & fit
                            
                        "	|"
                            
                                Sports Breakfast                                Body & fit
                            
                        "
€ 7,90	|€ 9,90	|€ 11,90	|€ 17,90	|€ 31,90	|€ 18,90	|€ 12,90	|€ 6,90	|€ 6,90	|€ 10,90	|€ 15,90	|€ 9,90	|€ 26,90	|€ 6,90	|€ 24,90	|€ 9,90	|€ 20,90

Answer 1

首先-产品名称。您正在从HTML中获取太多信息。 h2元素包含空格和span元素，在其中可能应该忽略它们。您可以这样做：

product_names.each do |product_name| 
  name << product_name.css('h2.product-name a').children[0].text.gsub(/\s{2,}/, '')
end

然后，CSV需要将每一行作为包含多个项目的数组传递。在您的情况下，应该有很多包含两个项目（产品名称和价格）的数组。为此，您可以简单地压缩两个表：

items = name.zip(price)

然后创建CSV文件：

CSV.open('data/wheyprotein.csv', 'w') do |csv|
  csv << headers
  items.each {|row| csv << row }
end

完整方法如下：

def whey_scrapper
    company = 'Body+%26+fit'
    url = "https://www.bodyenfitshop.nl/eiwittenwhey/whey-proteine/?limit=81&manufacturer=#{company}"
    unparsed_page = open(url).read
    parsed_page = Nokogiri::HTML(unparsed_page)
    product_names = parsed_page.css('div.product-primary')
    name = Array.new
    product_names.each do |product_name| 
        name << product_name.css('h2.product-name a').children[0].text.gsub(/\s{2,}/, '')
    end
    product_prices = parsed_page.css('div.price-box')
    price = Array.new
    product_prices.each do |product_price|
        price << product_price.css('span.price').text
    end
    headers = ["name", "price"]
    items = name.zip(price)
    CSV.open('data/wheyprotein.csv', 'w+') do |csv|
        csv << headers
        items.each {|row| csv << row }
    end
end

用rails以行和列格式构造csv

1 个答案: