在没有HTML标签的情况下构建Nokogiri输出

时间:2014-07-07 16:30:07

标签: ruby web-scraping nokogiri watir

我让Ruby前往一个网站,遍历一系列活动并抓取页面以获取特定数据。我现在遇到的问题是从Nokogiri给我的结构中获取它,并将其输出为可读形式。

campaign_list = Array.new
campaign_list.push(1042360, 1042386, 1042365, 992307)

browser = Watir::Browser.new :chrome
browser.goto '<redacted>'
browser.text_field(:id => 'email').set '<redacted>'
browser.text_field(:id => 'password').set '<redacted>'
browser.send_keys :enter

file = File.new('hourlysales.csv', 'w')
data = {}

campaign_list.each do |campaign|
  browser.goto "<redacted>"

  if browser.text.include? "Application Error"
    puts "Error loading page, I recommend restarting script"
    # Possibly automatic restart of script
  else
    hourly_data = Nokogiri::HTML.parse(browser.html).text   
    # file.write data
    puts hourly_data
  end

这是我得到的输出:

{"views":[[17,145],[18,165],[19,99],[20,71],[21,31],[22,26],[23,10],[0,15],[1,1],      [2,18],[3,19],[4,35],[5,47],[6,44],[7,67],[8,179],[9,141],[10,112],[11,95],[12,46],[13,82],[14,79],[15,70],[16,103]],"orders":[[17,10],[18,9],[19,5],[20,1],[21,1],[22,0],[23,0],[0,1],[1,0],[2,1],[3,0],[4,1],[5,2],[6,1],[7,5],[8,11],[9,6],[10,5],[11,3],[12,1],[13,2],[14,4],[15,6],[16,7]],"conversion_rates":[0.06870229007633588,0.05442176870748299,0.050505050505050504,0.014084507042253521,0.03225806451612903,0.0,0.0,0.06666666666666667,0.0,0.05555555555555555,0.0,0.02857142857142857,0.0425531914893617,0.022727272727272728,0.07462686567164178,0.06134969325153374,0.0425531914893617,0.044642857142857144,0.031578947368421054,0.021739130434782608,0.024390243902439025,0.05063291139240506,0.08571428571428572,0.06741573033707865]}

数组代表{ views [[hour, # of views], [hour, # of views], etc. }。与订单相同。我不需要转换率。

我还需要为每个键添加值,所以在这5个页面执行此操作后,我每天每小时都有一个键,以及该小时的总视图数。我尝试了几个each循环,但无法取得任何进展。

我感谢你们给我的任何帮助。

1 个答案:

答案 0 :(得分:1)

看起来输出(我假设的代码是hourly_data的内容)是JSON。在这种情况下,很容易解析和添加数字。像这样:

require "json" # at the top of your script
# ...

def sum_hours_values(data, hours_values=nil)
  # Start with an empty hash that automatically initializes missing keys to `0`
  hours_values ||= Hash.new {|hsh,hour| hsh[hour] = 0 }

  # Iterate through the [hour, value] arrays, adding `value` to the running
  # count for that `hour`, and return `hours_values`
  data.each_with_object(hours_values) do |(hour, value), hsh|
    hsh[hour] += value
  end
end

# ... Watir/Nokogiri stuff here...

# Initialize these so they persist outside the loop
hours_views, orders_views = nil

campaign_list.each do |campaign|
  browser.goto "<redacted>"

  if browser.text.include? "Application Error"
    # ...
  else
    # ...

    hourly_data_parsed = JSON.parse(hourly_data)

    hours_views = sum_hours_values(hourly_data_parsed["views"], hours_views)
    hours_orders = sum_hours_values(hourly_data_parsed["orders"], orders_views)
  end
end

puts "Views by hour:"
puts hours_views.sort.map {|hour_views| "%2i\t%4i" % hour_views }

puts "Orders by hour:"
puts hours_orders.sort.map {|hour_orders| "%2i\t%4i" % hour_orders }

P.S。有一个非常好的递归版sum_hours_values我没有包括,因为迭代版本对大多数Ruby程序员来说更清晰。如果你正在进行递归,我会把它作为锻炼给你。 ;)