Scrapy多个SgmlLinkExtractor规则不起作用

时间:2013-05-20 04:22:56

标签: scrapy

我想抓取整个网站并有条件地提取链接。

根据此链接的建议,我尝试了多个规则,但它不起作用。 Scrapy doesn't crawl all pages

我尝试使用此代码,但不会删除任何细节。

class BusinesslistSpider(CrawlSpider):
    name = 'businesslist'
    allowed_domains = ['www.businesslist.ae']
    start_urls = ['http://www.businesslist.ae/']

    rules = (
        Rule(SgmlLinkExtractor()),
        Rule(SgmlLinkExtractor(allow=r'company/(\d)+/'), callback='parse_item'),
    )

    def parse_item(self, response):
        self.log('Hi, this is an item page! %s' % response.url)
        hxs = HtmlXPathSelector(response)
        i = BusinesslistItem()
        company = hxs.select('//div[@class="text companyname"]/strong/text()').extract()[0]
        address = hxs.select('//div[@class="text location"]/text()').extract()[0]
        location = hxs.select('//div[@class="text location"]/a/text()').extract()[0]
        i['url'] = response.url
        i['company'] = company
        i['address'] = address
        i['location'] = location
        return i

在我的情况下,它不应用第二个规则,因此它不会解析详细信息页面。

1 个答案:

答案 0 :(得分:1)

第一条规则Rule(SgmlLinkExtractor())匹配每个链接,而scrapy只是忽略了第二条链接。

请尝试以下操作:

...
start_urls = ['http://www.businesslist.ae/sitemap.html']
...
# Rule(SgmlLinkExtractor()),