我目前正在尝试使用Nokogiri从网页上抓取数据。 我想从链接http://www.cardekho.com/Maruti/Noida/car-service-center.htm
中搜索服务中心列表的数据我为此编写的代码是:
require 'open-uri'
require 'nokogiri'
doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))
doc.css('.delrname').each do |node|
puts node.text
end
我已经尝试了一堆CSS标签的组合,但没有一个能够提供所需的结果。是否有人建议使用此链接正确抓取服务中心列表数据的标签?
提前致谢
PS:当我在其他网站上测试时,相同的代码(带有适当的CSS标记)正在按预期工作,但它在本网站上无效。
答案 0 :(得分:2)
您的代码似乎有效。我删除了网址中的空格:
doc = Nokogiri::HTML(open("http://www.cardekho.com/Maruti/Noida/car-service-center.htm"))
然后我尝试了,这是输出:
$ ruby file.rb Fast Track Auto Care India
Jkm Motors
Mangalam Motors
Motorcraft India
Motorcraft India
Rohan Motors
Rohan Motors
Rohan Motors
Vipul Motors
答案 1 :(得分:0)
或者,您可以使用正则表达式获取更详细的结果...例如,使用:
/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/
您可以打破以下结果:
arrMatches = doc.scan(/(<div class="delrname">([^<]*)<\/div><p>([^<]*)<\/p><div><div class="delermobcol "><div class="clearfix"><span class="mobico sprite"><\/span><div class="mobno">([^<]*)<\/div><\/div><div class="clear"><\/div><div class="viewsercntr"><a href="([^"]*)" title="View Car Dealers for Maruti in Noida">View Car Dealers for Maruti in Noida<\/a><\/div><\/div><div class="delermoilcol"><!----><div class="clearfix"><span class="mailico sprite"><\/span><div class="mobno"><a href="mailto:([^"]*)" target="_top">workshop.grn@rohanmotors.co.in<\/a><\/div>)/)
arrMatches.each do |dealerInfo|
thisEntireMatch = dealerInfo[0]
thisName = dealerInfo[1]
thisAddress = dealerInfo[2]
thisMobile = dealerInfo[3]
thisLink = dealerInfo[4]
thisEmail = dealerInfo[5]
end