在ruby nokogiri网络刮刀中拆分子阵列

时间:2016-08-17 16:48:53

标签: arrays ruby web-scraping nokogiri

您好我刚刚完成了以下教程:https://github.com/ryandhaase/Web-Scraper/blob/master/airbnb_scraper.rbhttps://medium.com/@tabor_francesca/web-scraper-airbnb-24d67939b08a#.mg7ny2tke。而我现在正在练习。我在拆分子阵列时遇到问题。一切正常,但我无法将城市,州和邮政编码拆分为单独的Excel列。

以下行不正确,我该如何解决?

city << [subarray[0], "this is not working", subarray[1]]

我的猜测还有另一条线需要修复。

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'


url = "https://www.tesla.com/findus/list/stores/United+States"

page = Nokogiri::HTML(open(url))

page = Nokogiri::HTML(open("https://www.tesla.com/findus/list/stores/United+States"))   
puts page.class   

name = []
street_address = []
extended_address = []
city = []
state = []
zip = []


    page.css('a.fn.org.url').each do |line|
      name << line.text.strip
    end

    page.css('span.street-address').each do |line|
      street_address << line.text
    end

    page.css('span.extended-address').each do |line|
        extended_address << line.text
    end

    page.css('span.locality').each do |line|
        subarray = line.text.strip.split(/ · /)

        if subarray.length == 3
            city << subarray
        else
            city << [subarray[0], "this is not working", subarray[1]]
    end

  end



CSV.open("teslaStores.csv", "w") do |file|
  file << ["Name", "Street Address", "Street Address Continued", "City", "State", "Zip"]

  name.length.times do |i|
    file << [name[i], street_address[i], extended_address[i], city[i], city[i][0], city[i][1]]
  end
end

2 个答案:

答案 0 :(得分:0)

就像仅供参考,这是未经测试的,但Ruby中的惯用代码更多:

require 'csv'
require 'nokogiri'
require 'open-uri'

page = Nokogiri::HTML(open('https://www.tesla.com/findus/list/stores/United+States'))   

name = page.css('a.fn.org.url').map{ |n| n.text.strip }
street_address = page.css('span.street-address').map { |n| n.text }
extended_address = page.css('span.extended-address').map{ |n| n.text }

city = page.css('span.locality').map { |n|
  subarray = n.text.strip.split(/ · /)

  if subarray.length == 3
    subarray
  else
    [subarray[0], 'this is not working', subarray[1]]
  end

}

CSV.open('teslaStores.csv', 'w') do |file|
  file << ['Name', 'Street Address', 'Street Address Continued', 'City', 'State', 'Zip']

  name.length.times do |i|
    file << [name[i], street_address[i], extended_address[i], city[i], city[i][0], city[i][1]]
  end
end

这可以进一步减少:

street_address, extended_address = [
  'span.street-address',
  'span.extended-address'
].map{ |selector|
  page.css(selector).map { |n| n.text }
}

答案 1 :(得分:0)

所以,我参加了一个关于python的meetup.com活动,并询问其中一条说明是否有帮助,即使该课程不在这个主题上:)。老师解释说我需要用逗号和空格分开。在我分裂之前的那段时间。

我不得不改变这个:

page.css('span.locality').each do |line|
        subarray = line.text.strip.split(/ · /)

        if subarray.length == 3
            city << subarray
        else
            city << [subarray[0], "this is not working", subarray[1]]
    end

对此:

page.css('span.locality').each do |line|
        subarray = line.text.strip.split(',')
        subarray2 = subarray[1].split(' ')

          city << subarray[0]
          state << subarray2[0]
          zip << subarray2[1]
    end

以下是完整的答案:

require 'rubygems'
require 'nokogiri'
require 'open-uri'
require 'csv'


url = "https://www.tesla.com/findus/list/stores/United+States"

page = Nokogiri::HTML(open(url))

page = Nokogiri::HTML(open("https://www.tesla.com/findus/list/stores/United+States"))   
puts page.class   

name = []
street_address = []
extended_address = []
city = []
state = []
zip = []


    page.css('a.fn.org.url').each do |line|
      name << line.text.strip
    end

    page.css('span.street-address').each do |line|
      street_address << line.text
    end

    page.css('span.extended-address').each do |line|
        extended_address << line.text
    end

    page.css('span.locality').each do |line|
        subarray = line.text.strip.split(',')
        subarray2 = subarray[1].split(' ')

          city << subarray[0]
          state << subarray2[0]
          zip << subarray2[1]
    end


CSV.open("teslaStores.csv", "w") do |file|
  file << ["Name", "Street Address", "Street Address Continued", "City", "State", "Zip"]

  name.length.times do |i|
    file << [name[i], street_address[i], extended_address[i], city[i], state[i], zip[i]]
  end
end