Question

我在使用Scrapy Crawler爬网javascript网站时遇到了麻烦。似乎Scrapy忽略了规则，而只是继续正常的抓取。

是否可以指示Spider使用Splash进行爬网？

谢谢。

class MySpider(CrawlSpider):
    name = 'booki'
    start_urls = [
    'https://worldmap.com/listings/in/united-states/',

    ]
    rules = (
        # Extract links matching 'category.php' (but not matching 'subsection.php')
        # and follow links from them (since no callback means follow=True by default).
        Rule(LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', ))),

        # Extract links matching 'item.php' and parse them with the spider's method parse_item
        Rule(LinkExtractor(allow=('catalogue', ),deny=('catalogue\/category')), callback='first_tier'),
#        )
    custom_settings = {
        #'DOWNLOAD_DELAY' : '2',
        'SPLASH_URL': 'http://localhost:8050',
        'DOWNLOADER_MIDDLEWARES': {
            'scrapy_splash.SplashCookiesMiddleware': 723,
            'scrapy_splash.SplashMiddleware': 725,
            'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware': 810,
        },
        'SPIDER_MIDDLEWARES': {
            'scrapy_splash.SplashDeduplicateArgsMiddleware': 100,
        },
        'DUPEFILTER_CLASS': 'scrapy_splash.SplashAwareDupeFilter',
        'DOWNLOAD_DELAY' : '8',
        'ITEM_PIPELINES' : {
            'bookstoscrap.pipelines.BookstoscrapPipeline': 300,
        }
    }

    def start_requests(self):
        for url in self.start_urls:
            yield SplashRequest(url, self.first_tier,
                endpoint='render.html',
                args={'wait': 3.5},
            )

Answer 1

仅当您在start_requests之后真正进入匹配页面时，规则才会触发。您还需要为规则定义callback函数，否则它们将尝试使用默认的parse（以防您看起来好像规则无所事事）。

要将规则的请求更改为SplashRequest，您必须在process_request回调中将其返回。例如：

class MySpider(CrawlSpider):
    # ...

    rules = (
        Rule(
            LinkExtractor(allow=('catalogue\/category', ), deny=('subsection\.php', )),
            process_request='splash_request'
        ),
        Rule(
            LinkExtractor(allow=('catalogue', ), deny=('catalogue\/category'),
            callback='first_tier',
            process_request='splash_request'
        ),
    )

    # ...

    def splash_request(self, request):
        return SplashRequest(
            request.url,
            callback=request.callback,
            endpoint='render.html',
            args={'wait': 3.5},
        )

如何使用带有Splash的Scrapy Crawler来爬行Javascript页面

1 个答案: