这是有问题的页面:http://phoenix.craigslist.org/cpg/
我想要做的是创建一个如下所示的数组:
日期(由该页面上的h4标记捕获)=>在单元格[0][0][0]
中,
链接文字=>在单元格[0][1][0]
中
链接href =>在单元格[0][1][1]
即。在每一行中,我每行存储这些项目。
我所做的只是将所有h4
标签拉入并将其存储在这样的哈希中:
contents2[link[:date]] = content_page.css("h4").text
这个问题是一个单元格存储了整个页面上h4标签的所有文本......而我希望1个单元格有1个日期。
以此为例:
0 => Mon May 28 - Leads need follow up - (Phoenix) - http://phoenix.craigslist.org/wvl/cpg/3043296202.html
1=> Mon May 28 - .Net/Java Developers - (phoenix) - http://phoenix.craigslist.org/cph/cpg/3043067349.html
对于我如何处理这个问题的任何想法都会非常感激。
答案 0 :(得分:3)
这是怎么回事?
require 'rubygems'
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://phoenix.craigslist.org/cpg/"))
# Postings start inside the second blockquote on the page
bq = doc.xpath('//blockquote')[1]
date = nil # Temp store of date of postings
posts = Array.new # Store array of all postings here
# Loop through all blockquote children collecting data as we go along...
bq.children.each { |nod|
# The date is stored in the h4 nodes. Grab it from there.
date = nod.text if nod.name == "h4"
# Skip nodes until we have a date
next if !date
# Skip nodes that are not p blocks. The p blocks contain the postings.
next if nod.name != "p"
# We have a p block. Extract posting data.
link = nod.css('a').first['href']
text = nod.text
# Add new posting to array
posts << [date, text, link]
}
# Output everything we just collected
posts.each { |p| puts p.join(" - ") }
答案 1 :(得分:2)
还有其他方法,但遍历可能是最简单的:
doc.traverse do |node|
@date = node.text if node.name == 'h4'
next unless @date
break if node.text['next 100 postings']
puts [@date, node.parent.text, node[:href]].join(' - ') if node.name == 'a'
end