Question

我想为start_urls中的每个着陆网做一些特别的事情，然后蜘蛛跟随所有下一页并深入爬行。所以我的代码大致是这样的：

def init_parse(self, response):
    item = MyItem()

    # extract info from the landing url and populate item fields here...

    yield self.newly_parse(response)
    yield item
    return

parse_start_url = init_parse

def newly_parse(self, response):
    item = MyItem2()
    newly_soup = BeautifulSoup(response.body)

    # parse, return or yield items

    return item

代码不会起作用，因为蜘蛛只允许返回项目，请求或无，但我会产生self.newly_parse，那么我怎样才能在scrapy中实现这一点？

我不那么优雅的解决方案：

将init_parse函数放在newly_parse中并在开头执行is_start_url检查，如果response.url在start_urls内，我们将通过init_parse过程

另一个丑陋的解决方案

将# parse, return or yield items发生的代码分开并使其成为类方法或生成器，并在init_parse和newly_parse内调用此方法或生成器。

Answer 1

如果您要在newly_parse下生成多个项目，则init_parse下的行应为：

for item in self.newly_parse(response):
    yield item

因为self.newly_parse将返回一个生成器，您需要先进行迭代，因为scrapy无法识别它。

scrapy：如何通过多个解析器函数解析响应？

我不那么优雅的解决方案：

另一个丑陋的解决方案

1 个答案: