这是一个简单的scrapy
蜘蛛,可以抓取yelp.com并获取数据
我已设置Rule(LinkExtractor(allow=('.*')),follow=True,callback="parseBusiness")
关注链接和回调为parseBusiness
但是,Scrapy在这里不跟随链接
这是特定输出(此处为完整输出http://pastebin.com/BkuErvMq)
2015-07-14 01:06:22 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023
2015-07-14 01:06:25 [scrapy] DEBUG: Crawled (200) <GET http://www.yelp.com/search?find_desc=Hotels&find_loc=San+Francisco%2C+CA&ns=1> (referer: None)
2015-07-14 01:06:26 [scrapy] DEBUG: Crawled (200) <GET http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco> (referer: None)
2015-07-14 01:06:26 [scrapy] INFO: Closing spider (finished)
2015-07-14 01:06:26 [scrapy] INFO: Dumping Scrapy stats:
这是我的代码
import sys
import scrapy
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
class Business(scrapy.Item):
name = scrapy.Field()
contactNumber = scrapy.Field()
address = scrapy.Field()
class YelpSpider(CrawlSpider):
name = "yelp"
allowed_domains = ["www.yelp.com"]
start_urls = [
"http://www.yelp.com/search?find_desc=Hotels&find_loc=San+Francisco%2C+CA&ns=1",
"http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco"
]
Rule(LinkExtractor(allow=()),follow=True,callback="parseBusiness")
def parseBusiness(self, response):
business = Business()
business['name'] = stripchars(response.xpath('//h1[@itemprop="name"]//text()').extract())
business['contactNumber'] = stripchars(response.xpath('//span[@itemprop="telephone"]//text()').extract())
business['address'] = stripchars(response.xpath('//li[@class="address"]//text()').extract())
yield business
我在这里缺少什么?让scrapy跟随所有链接
答案 0 :(得分:3)
您没有设置蜘蛛的rules
属性:
class YelpSpider(CrawlSpider):
name = "yelp"
allowed_domains = ["www.yelp.com"]
start_urls = [
"http://www.yelp.com/search?find_desc=Hotels&find_loc=San+Francisco%2C+CA&ns=1",
"http://www.yelp.com/biz/ucsf-medical-center-at-mount-zion-san-francisco"
]
rules = [
Rule(LinkExtractor(allow=('.*')),follow=True,callback="parseBusiness")
]