Question

我正在从事一个分为两个部分的项目：

检索特定页面
提取此页面的ID后，
发送请求到API以获得此页面上的其他信息

第二点，要遵循Scrapy的异步哲学，应将此类代码放在哪里？（我在蜘蛛网或管道之间犹豫）。我们是否必须使用不同的库（例如asyncio和aiohttp）才能异步实现此目标？（我爱aiohttp，所以使用它不是问题）

谢谢

Answer 1

由于要获取有关项目的其他信息，因此，我只是从解析方法中产生一个请求，将已经抓取的信息传递到meta属性中。

您可以在https://doc.scrapy.org/en/latest/topics/request-response.html#topics-request-response-ref-request-callback-arguments

看到一个示例

这也可以在管道中完成（使用scrapy的引擎API或其他库，例如treq）。
不过，我确实认为，在这种情况下，以蜘蛛的“正常方式”进行操作更有意义。

Answer 2

最近，我又遇到了同样的问题，并且找到了使用Twisted decorators t.i.d.inlineCallbacks的优雅解决方案。

# -*- coding: utf-8 -*-
import scrapy
import re
from twisted.internet.defer import inlineCallbacks

from sherlock import utils, items, regex


class PagesSpider(scrapy.spiders.SitemapSpider):
    name = 'pages'
    allowed_domains = ['thing.com']
    sitemap_follow = [r'sitemap_page']

    def __init__(self, site=None, *args, **kwargs):
        super(PagesSpider, self).__init__(*args, **kwargs)

    @inlineCallbacks
    def parse(self, response):
        # things
        request = scrapy.Request("https://google.com")
        response = yield self.crawler.engine.download(request, self) 
        # Twisted execute the request and resume the generator here with the response
        print(response.text)

从Scrapy内部的API获取数据

2 个答案: