Question

这是我的代码

import scrapy
class PvSpider(scrapy.Spider):
   name = 'pv'
   allowed_domains = ['www.piaov.com']
   start_urls = ['http://www.piaov.com/']

   def start_requests(self):
       yield scrapy.Request(url='http://www.piaov.com/list/7.html')

   def parse(self, response):
       names = response.xpath("//ul[@class='mlist']//li/a/@title").extract()
       on = response.meta.get("names", [])
       cmp_names = on + names
       for p in range(2, 7):
           yield scrapy.Request(url='http://www.piaov.com/list/7_{}.html'.format(p),
                                meta={"names": cmp_names},
                                callback=self.parse)

       yield scrapy.Request("http://www.piaov.com", meta={"names": cmp_names}, callback=self.parse_item)

   def parse_item(self, response):
       pass

当我在'parse_item'函数中调试我的代码时，'response.meta [“names”]'只包含第一页数据（在这种情况下为12个标题），我怎么能得到6页数据列表。< / p>

Answer 1

因为您有网址http://www.piaov.com，scrapy会忽略重复的网址，除非dont_filter=True中指定了Request，如Request(url_here, dont_filter=True)

另外，我不喜欢你的刮刀逻辑，你为什么要打{{1}}？没有必要。请参阅下面的代码并按照这样做。

parse_item

我如何在scrapy中递归？

1 个答案: