Question

我有一个蜘蛛从网页上抓取数据并将标题，文本和img url写入mongoDB。

我有两个功能：

def parse_news(self, response):
    item = NewsItem()
    item['_id'] = .. #key for MongoDB - Unique
    item['Title'] = ..
    item['URL'] = ..
    if len(..): #check if the article has a gallery
        for i in xrange(2, 5): #if yes iterate through all the images
                gallery_img_link = urlparse.urljoin(response.url, '%d/#gallery_photo' %i)
                yield Request(gallery_img_link, meta={'item': item}, callback=self.parse_gallery) #request the page and call the function that extracts the img url
    yield item

def parse_gallery(self, response):
    if len(response.xpath('//*[@id="gallery_photo"]/div/img/@src').extract_first()): #check if img URL exists so that if you get out of range there are no empty values
        item = response.meta['item']
        item['Gallery'] = response.xpath('//*[@id="gallery_photo"]/div/img/@src').extract_first()
        yield item

我希望item['Gallery']将提取的img的URL存储为数组，并在循环结束时将其写入mongoDB。

因此，要将item['Gallery']传递给第二个函数，请将img url添加到该函数，并在if循环完成时获取数据以生成或写入mongodb。

为什么需要：我面临的问题是提取图库的图像网址。图库没有所有图像的列表，但您必须单击“下一步”以获取下一个图像URL。单击图库中的下一个图像时，它会刷新整个页面并更改页面的URL，如下所示：

http://www.website.com/news-1-title/2/#gallery_photo表示第二张图片，/3/#gallery_photo表示第三张图片，依此类推。

该函数从2-5循环并检查是否有img url并将其解压缩。

提前致谢

Scrapy：在屈服之前，在不同的函数上多次操作相同的项

0 个答案: