我正在尝试抓取以下网站,因为XML格式错误且不包含我需要的所有数据:
http://www.cafebonappetit.com/menu/your-cafe/pitzer
但是,当我使用Mechanize获取文档时,我只能得到:
{meta_refresh}
{title "Collins | Claremont McKenna Cafés | Café Bon Appétit"}
{iframes}
{frames}
{links
#<Mechanize::Page::Link "Welcome" "http://www.cafebonappetit.com/">
#<Mechanize::Page::Link "Our Approach" "javascript://">
#<Mechanize::Page::Link
"Kitchen Principles"
"http://www.cafebonappetit.com/our-approach/kitchen-principles">
.....
}
不幸的是,我显然需要了解表格中的内容(我猜他们是iFrames)。有什么想法吗?
谢谢!
答案 0 :(得分:3)
这是一个简单的mech + Nokogiri脚本,可以删除菜单项。
require 'rubygems'
require 'mechanize'
require 'pp'
agent = Mechanize.new
url = "http://www.cafebonappetit.com/menu/your-cafe/pitzer"
page = agent.get(url)
#Grab each daily menu
page.search('div#menu-items > table.my-day-menu-table').each do |menu|
day = menu.xpath('preceding-sibling::div[1]/a').text.strip
puts day
fare = []
#Collect the menu items
menu.xpath('tr').each do |item|
fare << item.xpath('td/strong').map(&:text).join(": ")
end
pp fare
end
结果(摘录):
Sunday, May 6th, 2012
["Brunch",
"chef's table: custom omelet bar",
"main plate: chicken sanchez",
"meatless chicken and sauce",
"options: banana pancakes",
"stocks: beed barley",
"vegetable minestrone",
"Lunch",
"main plate: steamed broccoli",
"Dinner",
"chef's table: pasta bar",
"farm to fork: sauteed rainbow chard",
"options: mozzarella sticks",
"ovens: pizza bar",
"main plate: roasted herb chicken",
"baked ziti pasta",
"steamed carrots and parsnips"]