Question

代码：

import scrapy
from scrapy.spiders import CrawlSpider
from scrapy import Request


class TestSpider(CrawlSpider):
  name = "test_spyder"
  allowed_domains = ["stackoverflow.com"]
  start_urls = ['https://stackoverflow.com/tags']

  def parse(self, response):
    title_1 = response.xpath('//h1/text()').extract_first()
    next_url = 'https://stackoverflow.com/users'
    title_2 = Request(url=next_url, callback=self.parse_some)
    yield {'title_1': title_1, 'title_2': title_2}

  def parse_some(self, response):
    return response.xpath('//h1/text()').extract_first()

我不明白为什么第二页标题（用户）却获得其他值（https://stackoverflow.com/users>）。 Scrapy应该返回下一个值：Tags + Users，但是在我认为的列表中返回：Tag + <Request GET htt...。错误在哪里以及如何解决？

Answer 1

要抓取网址，您需要yield对象Request。因此，您的解析回调应为：

创建词典/ Item-这是爬网链的结尾。该项目正在生成，它通过管道发送，并最终保存在某个位置（如果已进行设置）。
产生了一个Request对象-这仍将爬网链继续到另一个回调。

此过程的示例：

抓取网址1（2）
抓取网址2（2）
产量项目（1）

在这种情况下，您的蜘蛛应如下所示：

def parse(self, response):
    title = response.xpath('//h1/text()').extract_first()
    yield {'title': title}

    next_url = 'https://stackoverflow.com/users'
    yield Request(url=next_url, callback=self.parse_some)

您的最终结果将以scrapy crawl spider -o output.json爬行：

# output.json
[
{'title': 'title1'},
{'title': 'title2'}
]

为什么Scrapy不从函数返回值？

1 个答案: