用scrapy废弃多个网址和分页

时间:2017-11-06 17:43:59

标签: python web-scraping scrapy

我需要调用预定义的类别网址列表。在每个类别中,我必须从页面中提取数据,也可以转到下一页上的链接并再次提取。

我有这个示例代码,但缺少一些东西:

import scrapy
import re

class YellowBot(scrapy.Spider):
    name = "yellow"
    allowed_domains = ["www.yellowpages.com"]
    start_urls = [
        'http://www.yellowpages.com/b/category1/',
        'http://www.yellowpages.com/b/category2/',
        'http://www.yellowpages.com/b/category3/',
        'http://www.yellowpages.com/b/category4/'
    ]
    headers = {
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_12_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/61.0.3163.91 Safari/537.36'
    }



    def start_requests(self):
        for url in self.start_urls:
            yield scrapy.Request(
                url=url,
                headers=self.headers
            )

    def parse(self, response):
        self.logger.info('- page %s', response.url)
        ITEM_SELECTOR = 'ul.businesses li'
        SOURCE_TYPE = 'pages'
        for ficha in response.css(ITEM_SELECTOR):
            ficha = {
                'id'  : ficha.xpath('normalize-space(.//@data-bid)').extract_first(),
                'name'   : ficha.css('.business-name ::text').extract_first(),
                'description': ficha.xpath('.//div[@itemprop="description"]/text()').extract_first()
            }

            if ficha['id'] is not None:
                yield ficha

            next_page = response.css('.m-results-pagination li.last a::attr(href)').extract_first()
            if next_page is not None:
                yield scrapy.Request(
                    response.urljoin(next_page),
                    headers=self.headers
                )

它仅提取第一个类别(由于寻呼机,也会提取以下页面): http://www.yellowpages.com/b/category1/

但它不会处理下一个类别页面: http://www.yellowpages.com/b/category2/

1 个答案:

答案 0 :(得分:0)

请尝试以下更改:allowed_domains = ['yellowpages.com']

yield scrapy.Request(url=url, callback=self.parse, headers=self.headers)。您还应在已生成的下一页请求中说明callback=self.parse

另外,不确定你的start_urls。我尝试使用您的选择器通过scrapy.shell检查响应,response.css('ul.businesses li')返回一个空列表。