在start_pares中使用lambda进行抓取

时间:2018-07-18 10:03:00

标签: scrapy python-3.5

谁能告诉我为什么parse()中的索引变量的数量始终是10013?

class GetsourcesSpider(scrapy.Spider):
name = 'getSources'
allowed_domains = ['bizhi.feihuo.com']
base_url = 'http://bizhi.feihuo.com/wallpaper/share?rsid={index}/'

def start_requests(self):
    for index in range(10010, 10014):#11886
        yield scrapy.Request(url=self.base_url.format(index=index), callback=lambda response:self.parse(response,index))

def parse(self, response, index):
    video_label = response.xpath('//video')[0]
    item = DynamicdesktopItem()
    item['index'] = index # response.url[-6:-1]
    item['video'] = video_label.attrib['src']
    item['image'] = video_label.attrib['poster']
    yield item

2 个答案:

答案 0 :(得分:2)

那是因为您给index变量引用而不是值,所以才得到最后一个值。您需要使用meta对象。请在下面查看更新的代码

class GetsourcesSpider(scrapy.Spider):
    name = 'getSources'
    allowed_domains = ['bizhi.feihuo.com']
    base_url = 'http://bizhi.feihuo.com/wallpaper/share?rsid={index}/'

    def start_requests(self):
        for index in range(10010, 10014):#11886
            yield scrapy.Request(url=self.base_url.format(index=index), callback=self.parse, meta = {'index': index})

    def parse(self, response):
        index = response.meta['index']
        video_label = response.xpath('//video')[0]
        item = DynamicdesktopItem()
        item['index'] = index # response.url[-6:-1]
        item['video'] = video_label.attrib['src']
        item['image'] = video_label.attrib['poster']
        yield item

答案 1 :(得分:0)

因为所有lambda引用的index变量未复制到其本地范围。每次下一次循环迭代时都会对其进行重写。 请考虑以下代码段:

lambdas = []
for i in range(3):
    lambdas.append(lambda: print(i))
for fn in lambdas:
    fn()

这将打印三个2,最后一个值为i

您应该使用Request类的meta=关键字,而不是执行lambda回调: https://doc.scrapy.org/en/latest/topics/request-response.html#request-meta-special-keys