ruby / nokogiri scraping - 导出到多个CSV,然后从每个CSV中取出列并合并成最终的CSV

时间:2014-07-18 23:21:15

标签: ruby-on-rails ruby web-scraping nokogiri export-to-csv

Ruby n00b在这里。我正在两次抓取同一页面 - 但每次都以略微不同的方式 - 并将它们导出为单独的CSV文件。我想将CSV no.1的第一列和CSV no.2的第二列组合在一起,创建CSV no.3。

提取CSV码1和1的代码2件作品。但添加我尝试将两个CSV组合到第三个(在底部注释掉)返回以下错误 - 两个CSV填充正常,但第三个保持空白,脚本处于看似无限循环的状态。我知道这些线不应该在底部,但我看不出它会去哪里......

alts.rb:45:in `block in <main>': undefined local variable or method `scrapedURLs1' for main:Object (NameError)
    from /Users/JammyStressford/.rvm/rubies/ruby-2.0.0-p451/lib/ruby/2.0.0/csv.rb:1266:in `open'
    from alts.rb:44:in `<main>'

代码本身:

require 'rubygems'
require 'nokogiri'   
require 'open-uri'
require 'csv'


url = "http://www.example.com/page"
page = Nokogiri::HTML(open(url))


CSV.open("results1.csv", "wb") do |csv|
  page.css('img.product-card-image').each do |scrape|
    product1 = scrape['alt']
    page.css('a.product-card-image-link').each do |scrape|
      link1 = scrape['href']

      scrapedProducts1 = "#{product1}"[0..-7]
      scrapedURLs1 = "{link1}"

      csv << [scrapedProducts1, scrapedURLs1]
    end
  end
end

CSV.open("Results2.csv", "wb") do |csv|
  page.css('a.product-card-image-link').each do |scrape|
    link2 = scrape['href']
    page.css('img.product-card-image').each do |scrape|
      product2 = scrape['alt']

      scrapedProducts2 = "#{product2}"[0..-7]
      scrapedURLs2 = "http://www.lyst.com#{link2}"

      csv << [scrapedURLs2, scrapedProducts2]
    end
  end
end

## Here is where I am trying to combine the two columns into a new CSV. ##
## It doesn't work. I suspect that this part should be further up...    ##

# CSV.open("productResults3.csv", "wb") do |csv|
  # csv << [scrapedURLs1, scrapedProducts2]
#end
puts "upload complete!"

感谢阅读。

1 个答案:

答案 0 :(得分:0)

感谢您分享您的代码和问题。我希望我的意见有所帮助!

  • 您的scrapedURLs1 = "{link}"scrapedProducts1 = "#{scrape['alt']}"[0..-7]最后有一个 1 ,但您不能在csv << [scrapedProducts, scrapedURLs] 上调用它是你得到的错误

  • 我建议您结合前两个步骤跳过 写入文件,但进入数组数组,然后你可以写 他们要提交。

  • 您是否在您提供的示例代码中意识到这一点 scrapedURLs1, scrapedProducts2会混淆错误的网址 错误的产品。这是你的意思吗?

  • 在注释掉的代码scrapedURLs1, scrapedProducts2中不存在,它们尚未被声明。您需要打开两个文件以使用.each do |scrapedURLs1|读取,然后打开另一个.each do |scrapedProducts2|,然后这些变量将存在,因为each枚举器实例化它们。

在内部迭代中重用相同的|scrape|变量并不是一个好主意。将名称更改为其他名称,例如|scrape2|。它&#34;发生&#34;工作,因为你已经在第二个循环之前已经在product=scrape['alt']中获得了你需要的东西。如果重命名第二个循环变量,可以将product=scrape['alt']行移动到内部循环中并合并它们。例如:

# In your code example you may get many links per product.
# If that was your intent then that may be fine.
# This code should get one link per product.
CSV.open("results1.csv", "wb") do |csv|
  page.css('img.product-card-image').each do |scrape|
    page.css('a.product-card-image-link').each do |scrape2|
      #      [      product       ,     link       ]
      csv << [scrape['alt'][0..-7], scrape2['href']]
      # NOTE that scrape['alt'][0..-7] and scrape2['href'] are already strings
      # so you don't need to use "#{ }"
    end
  end
end

附注:Ruby 2.0.0不需要行require "rubygems"

如果您正在使用CSV,我强烈建议您使用James Edward Gray II的faster_csv gem 。请在此处查看使用示例:https://github.com/JEG2/faster_csv/blob/master/examples/csv_writing.rb