Question

我想知道是否有更好的方法可以使用同一蜘蛛在同一网页内搜索多个URL。我有几个要通过索引访问的URL。

代码为：

class MySpider(scrapy.Spider):
limit = 5
pages = list(range(1, limit))
shuffle(pages)
cat_a = 'http://example.com/a?page={}'
cat_b = 'http://example.com/b?page={}'

    def parse(self, response):
        for i in self.pages:
          page_cat_a = self.cat_a.format(i)
          page_cat_b = self.cat_b.format(i)
          yield response.follow(page_cat_a, self.parse_page)
          yield response.follow(page_cat_b, self.parse_page)

函数parse_page继续抓取这些页面中的其他数据。

在输出文件上，我可以看到数据是按重复顺序收集的，所以我有10个来自类别a的网页，然后有10个来自类别b的网页重复。我想知道我正在爬网的Web服务器是否会注意到这些连续的行为并会禁止我。

此外，我在要爬网的同一网页中有8个URL，所有URL都使用索引，因此不是示例中给出的2个类别，而是8个。谢谢。

Answer 1

您可以使用start_requests蜘蛛方法，而不是在parse方法中执行此操作。

import scrapy
from random import shuffle

class MySpider(scrapy.Spider):
    categories = ('a', 'b')
    limit = 5
    pages = list(range(1, limit))
    base_url = 'http://example.com/{category}?page={page}'

    def start_requests(self):
        # Shuffle pages to try to avoid bans
        shuffle(pages)

        for category in categories:
            for page in pages:
                url = self.base_url.format(category=category, page=page)
                yield scrapy.Request(url)

    def parse(self, response):
        # Parse the page
        pass

您可以尝试做的另一件事是在网站内搜索类别网址。假设您要从http://quotes.toscrape.com/侧栏上显示的标签中获取信息。您可以手动复制链接并按您的方式使用它，也可以这样做：

import scrapy

class MySpider(scrapy.Spider):
    start_urls = ['http://quotes.toscrape.com/']

    def parse(self, response):
        for tag in response.css('div.col-md-4.tags-box a.tag::attr(href)').getall():
            yield response.follow(tag, callback=self.parse_tag)

    def parse_tag(self, response):
        # Print the url we are parsing
        print(response.url)

我想知道我正在爬网的Web服务器是否会注意到这些顺序行为并可能禁止我。

是的，该网站可能会注意到。仅让您知道，不能保证请求将按照您“屈服”的顺序进行。

同一蜘蛛的多个URL

1 个答案: