这是我第一次尝试使用Nokogiri解析网页。
我正在尝试从网页中提取地址并将其存储在CSV文件中。到目前为止,我只能提取City,State和Zip字段。
我不知道如何提取设施名称,地址,电话,号码和公司信息。地址可能包含一个或两个街道组件。
对于手机,可能有一个或多个电话号码。电话号码可以是常规号码或传真号码,但它们仅在文本中指示而不是标签。对于公司,我希望能够提取URL和名称。
页面上的每个地址都包含如下:
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
<!-- address entry -->
<div id='4567' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>456 DEF Rd</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>555.555.5555</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>
<!-- address entry -->
这是我非常基本的设置。
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(open('[URL]'))
Cities = Array.new
States = Array.new
Zips = Array.new
doc.css("p[class='address']").css("span[class='city']").each do |city|
Cities << city.content
end
doc.css("p[class='address']").css("span[class='state']").each do |state|
States << state.content
end
doc.css("p[class='address']").css("span[class='zip']").each do |zip|
Zips << zip.content
end
CSV.open("myCSV.csv", "wb") do |row|
row << ["City", "State", "Zip"]
(0..Cities.length - 1).each do |index|
row << [Cities[index], States[index], Zips[index]]
end
end
在这里将信息存储在单独的数组中似乎非常笨重。我基本上喜欢在源表文档中每次出现地址节点时在CSV表中创建一个行条目,然后用字段填充它(如果它们存在):
Facility St_1 St_2 City State Zip Phone Fax URL Company
======== ===== ===== ===== ====== ==== ====== ==== ==== ============
xxxxxxxx xxxx xxxx xxxxx xxxx xxxxx xxxx xxxxxxxx
xxxxxxxx xxxx xxxxx xxxx xxxxx xxxx xxxxx xxxx xxxx xxxxxxxx
有人可以帮助我吗?
答案 0 :(得分:0)
你要求很多,但我会让你开始:
fields = %w{street1 street2 phone fax city state zip}
doc.search('div.address').each do |div|
address = {}
address['street1'], address['street2'] = *div.search('span.street').map(&:text)
address['phone'], address['fax'] = *div.search('span.tel').map(&:text)
['city', 'state', 'zip'].each{|f| address[f] = div.at("span.#{f}").text}
csv << fields.map{|f| address[f]}
end
答案 1 :(得分:0)
你可能有一些无法处理的边缘情况,但这会照顾你的例子。您需要将文档更改为从真实页面而不是数据段中读取,并且您需要将csv更改为打印到文件而不是像我一样显示内联。
require 'nokogiri'
require 'open-uri'
require 'csv'
doc = Nokogiri::HTML(DATA.read)
CompanyInfo = Struct.new :facility, :street1, :street2, :city, :state, :zip, :phone, :fax, :url, :company
company_infos = []
doc.css("div.address").each do |address_div|
facility = address_div.at_css('.address_header .header_name').text.strip
info = address_div.css('div.address_details .info')
street1, street2 = info.css('.street').map(&:text)
city = info.at_css('.city').text
state = info.at_css('.state').text
zip = info.at_css('.zip').text
phone, fax = info.css('.phone .tel').map(&:text)
url = info.at_css('.company a')['href']
company = info.at_css('.company a').text
company_infos << CompanyInfo.new(facility, street1, street2, city, state, zip, phone, fax, url, company)
end
csv = CSV.generate do |csv|
csv << %w[Facility Street1 Street2 City State Zip Phone Fax URL Company]
company_infos.each do |company_info|
csv << company_info.to_a
end
end
csv # => "Facility,Street1,Street2,City,State,Zip,Phone,Fax,URL,Company\nFacility Name,123 ABC St,Unit 1,New York,NY,10022,999.999.9999,888.888.8888,{URL},Company Name\n"
__END__
<!-- address entry -->
<div id='1234' class='address'>
<div class='address_header'>
<h1 class='header_name'>
<strong><a href='{URL}'>Facility Name</a></strong>
</h1>
<h2 class='header_city'>
New York
</h2>
</div>
<div class='address_details'>
<div class='info'>
<p class='address'>
<span class='street'>123 ABC St</span><br />
<span class='street'>Unit 1</span><br />
<span class='city'>New York</span>,
<span class='state'>NY</span>
<span class='zip'>10022</span>
</p>
<p class='phone'>
Phone: <span class='tel'>999.999.9999</span>
</p>
<p class='phone'>
Fax: <span class='tel'>888.888.8888</span>
</p>
<p class='company'>
Company: <a href='{URL}'>Company Name</a>
</p>
</div>
</div>
</div>