Question

我想使用Scrapy和Python 3抓取业务目录。

您知道“双向”抓取的概念：

第一方向=>抓取显示在结果的第一页上的项的详细信息页面（业务A的详细信息页面，业务B的详细信息等...）的网址。
第二方向=>抓取结果页面的分页网址（第1页，第2页，第3页等）

我想你明白这件事。

我要剪贴的网站业务目录具有第三个“方向”。我要在此网站中删除的所有业务都是由ALPHABET组织的。

我需要单击一个字母以使成千上万的企业在多个页面上显示分页（请查看所附图片以更好地理解）。

因此，我在start_urls中手动添加了字母URL。但这没有用。看一下我的代码：

class AnnuaireEntreprisesSpider(scrapy.Spider):
    name = 'example'
    allowed_domains = ['example.com']
    #here I added the list of ALPHABET pages which containes several pages of results per letter
    start_urls = ['http://www.example.com/entreprises-0-9.html',
                  'http://www.example.com/entreprises-a.html',
                  'http://www.example.com/entreprises-b.html',
                  'http://www.example.com/entreprises-c.html',
                  'http://www.example.com/entreprises-d.html',
                  'http://www.example.com/entreprises-e.html',
                  'http://www.example.com/entreprises-f.html',
                  'http://www.example.com/entreprises-g.html',
                  'http://www.example.com/entreprises-h.html',
                  'http://www.example.com/entreprises-i.html',
                  'http://www.example.com/entreprises-j.html',
                  'http://www.example.com/entreprises-k.html',
                  'http://www.example.com/entreprises-l.html',
                  'http://www.example.com/entreprises-m.html',
                  'http://www.example.com/entreprises-n.html',
                  'http://www.example.com/entreprises-o.html',
                  'http://www.example.com/entreprises-p.html',
                  'http://www.example.com/entreprises-q.html',
                  'http://www.example.com/entreprises-r.html',
                  'http://www.example.com/entreprises-s.html',
                  'http://www.example.com/entreprises-t.html',
                  'http://www.example.com/entreprises-u.html',
                  'http://www.example.com/entreprises-v.html',
                  'http://www.example.com/entreprises-w.html',
                  'http://www.example.com/entreprises-x.html',
                  'http://www.example.com/entreprises-y.html',
                  'http://www.example.com/entreprises-z.html'
                  ]




    def parse(self, response):
        urls = response.xpath("//a[@class='btn-fiche dcL ']/@href").extract()

        for url in urls:
            #here I scrap the urls of detail page of business
            absolute_url = response.urljoin(url)
            print('Voici absolute url :' + absolute_url)
            yield Request(absolute_url, callback=self.parse_startup)

        next_page = response.xpath("//a[@class='nextPages']/@href").get() or ''
        if next_page:
            #Here I scrap the pagination urls
            absolute_next_page = response.urljoin(next_page)
            print('Voici absolute url NEXT PAGE :' + absolute_next_page)
            yield response.follow(next_page, callback=self.parse)

    def parse_startup(self, response):
        print("Parse_startup details!!!")
        #and here I scrap the details of the business

我是一个初学者，几周前开始学习Scrapy。

如何进行“双向”刮刮？

0 个答案: