我想抓取整个网站并有条件地提取链接。
根据此链接的建议,我尝试了多个规则,但它不起作用。 Scrapy doesn't crawl all pages
我尝试使用此代码,但不会删除任何细节。
class BusinesslistSpider(CrawlSpider):
name = 'businesslist'
allowed_domains = ['www.businesslist.ae']
start_urls = ['http://www.businesslist.ae/']
rules = (
Rule(SgmlLinkExtractor()),
Rule(SgmlLinkExtractor(allow=r'company/(\d)+/'), callback='parse_item'),
)
def parse_item(self, response):
self.log('Hi, this is an item page! %s' % response.url)
hxs = HtmlXPathSelector(response)
i = BusinesslistItem()
company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0]
address = hxs.select('//div[@class="text location"]/text()').extract()[0]
location = hxs.select('//div[@class="text location"]/a/text()').extract()[0]
i['url'] = response.url
i['company'] = company
i['address'] = address
i['location'] = location
return i
在我的情况下,它不应用第二个规则,因此它不会解析详细信息页面。
答案 0 :(得分:1)
第一条规则Rule(SgmlLinkExtractor())
匹配每个链接,而scrapy只是忽略了第二条链接。
请尝试以下操作:
...
start_urls = ['http://www.businesslist.ae/sitemap.html']
...
# Rule(SgmlLinkExtractor()),