使用Nokogiri提取可选的地址组件

时间:2012-09-29 11:28:40

标签: ruby nokogiri

这是我第一次尝试使用Nokogiri解析网页。

我正在尝试从网页中提取地址并将其存储在CSV文件中。到目前为止,我只能提取City,State和Zip字段。

我不知道如何提取设施名称,地址,电话,号码和公司信息。地址可能包含一个或两个街道组件。

对于手机,可能有一个或多个电话号码。电话号码可以是常规号码或传真号码,但它们仅在文本中指示而不是标签。对于公司,我希望能够提取URL和名称。

页面上的每个地址都包含如下:

  <!-- address entry -->

  <div id='1234' class='address'> 

    <div class='address_header'> 
      <h1 class='header_name'>
        <strong><a href='{URL}'>Facility Name</a></strong>
      </h1>
      <h2 class='header_city'>
        New York
      </h2>
    </div> 

    <div class='address_details'> 
      <div class='info'> 
        <p class='address'>
      <span class='street'>123 ABC St</span><br />
      <span class='street'>Unit 1</span><br />
      <span class='city'>New York</span>, 
          <span class='state'>NY</span> 
          <span class='zip'>10022</span>
        </p>
        <p class='phone'>
          Phone: <span class='tel'>999.999.9999</span>
        </p>
        <p class='phone'>
          Fax: <span class='tel'>888.888.8888</span>
        </p>
        <p class='company'>
          Company: <a href='{URL}'>Company Name</a>
        </p>
      </div>  
    </div> 
  </div>  
  <!-- address entry -->

  <!-- address entry -->

  <div id='4567' class='address'> 

    <div class='address_header'> 
      <h1 class='header_name'>
        <strong><a href='{URL}'>Facility Name</a></strong>
      </h1>
      <h2 class='header_city'>
        New York
      </h2>
    </div> 

    <div class='address_details'> 
      <div class='info'> 
        <p class='address'>
      <span class='street'>456 DEF Rd</span><br />
      <span class='city'>New York</span>, 
          <span class='state'>NY</span> 
          <span class='zip'>10022</span>
        </p>
        <p class='phone'>
          Phone: <span class='tel'>555.555.5555</span>
        </p>
        <p class='company'>
          Company: <a href='{URL}'>Company Name</a>
        </p>
      </div>  
    </div> 
  </div>  
  <!-- address entry -->

这是我非常基本的设置。

require 'nokogiri'
require 'open-uri'
require 'csv'

doc = Nokogiri::HTML(open('[URL]'))

Cities = Array.new
States = Array.new
Zips = Array.new

doc.css("p[class='address']").css("span[class='city']").each do |city|
  Cities << city.content
end

doc.css("p[class='address']").css("span[class='state']").each do |state|
    States << state.content
end

doc.css("p[class='address']").css("span[class='zip']").each do |zip|
    Zips << zip.content
end

CSV.open("myCSV.csv", "wb") do |row|
    row << ["City", "State", "Zip"]
    (0..Cities.length - 1).each do |index|
    row << [Cities[index], States[index], Zips[index]]
  end
end

在这里将信息存储在单独的数组中似乎非常笨重。我基本上喜欢在源表文档中每次出现地址节点时在CSV表中创建一个行条目,然后用字段填充它(如果它们存在):

Facility  St_1  St_2  City  State  Zip  Phone  Fax  URL  Company
========  ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx  xxxx        xxxx  xxxxx  xxxx xxxxx       xxxx xxxxxxxx
xxxxxxxx  xxxx  xxxxx xxxx  xxxxx  xxxx xxxxx  xxxx xxxx xxxxxxxx

有人可以帮助我吗?

2 个答案:

答案 0 :(得分:0)

你要求很多,但我会让你开始:

fields = %w{street1 street2 phone fax city state zip}
doc.search('div.address').each do |div|
  address = {}
  address['street1'], address['street2'] = *div.search('span.street').map(&:text)
  address['phone'], address['fax'] = *div.search('span.tel').map(&:text)
  ['city', 'state', 'zip'].each{|f| address[f] = div.at("span.#{f}").text}
  csv << fields.map{|f| address[f]}
end

答案 1 :(得分:0)

你可能有一些无法处理的边缘情况,但这会照顾你的例子。您需要将文档更改为从真实页面而不是数据段中读取,并且您需要将csv更改为打印到文件而不是像我一样显示内联。

require 'nokogiri'
require 'open-uri'
require 'csv'

doc = Nokogiri::HTML(DATA.read)

CompanyInfo   = Struct.new :facility, :street1, :street2, :city, :state, :zip, :phone, :fax, :url, :company
company_infos = []

doc.css("div.address").each do |address_div|
  facility         = address_div.at_css('.address_header .header_name').text.strip
  info             = address_div.css('div.address_details .info')
  street1, street2 = info.css('.street').map(&:text)
  city             = info.at_css('.city').text
  state            = info.at_css('.state').text
  zip              = info.at_css('.zip').text
  phone, fax       = info.css('.phone .tel').map(&:text)
  url              = info.at_css('.company a')['href']
  company          = info.at_css('.company a').text

  company_infos << CompanyInfo.new(facility, street1, street2, city, state, zip, phone, fax, url, company)
end

csv = CSV.generate do |csv|
  csv << %w[Facility Street1 Street2 City State Zip Phone Fax URL Company]
  company_infos.each do |company_info|
    csv << company_info.to_a
  end
end

csv # => "Facility,Street1,Street2,City,State,Zip,Phone,Fax,URL,Company\nFacility Name,123 ABC St,Unit 1,New York,NY,10022,999.999.9999,888.888.8888,{URL},Company Name\n"


__END__
<!-- address entry -->

<div id='1234' class='address'> 

  <div class='address_header'> 
    <h1 class='header_name'>
      <strong><a href='{URL}'>Facility Name</a></strong>
    </h1>
    <h2 class='header_city'>
      New York
    </h2>
  </div> 

  <div class='address_details'> 
    <div class='info'> 
      <p class='address'>
        <span class='street'>123 ABC St</span><br />
        <span class='street'>Unit 1</span><br />
        <span class='city'>New York</span>, 
        <span class='state'>NY</span> 
        <span class='zip'>10022</span>
      </p>
      <p class='phone'>
        Phone: <span class='tel'>999.999.9999</span>
      </p>
      <p class='phone'>
        Fax: <span class='tel'>888.888.8888</span>
      </p>
      <p class='company'>
        Company: <a href='{URL}'>Company Name</a>
      </p>
    </div>  
  </div> 
</div>