Scrapy图像下载

时间:2016-08-04 16:22:24

标签: python image scrapy

我的蜘蛛在没有显示任何错误的情况下运行,但图像未存储在此处的文件夹中是我的scrapy文件:

Spider.py:

import scrapy
import re
import os
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem, ListResidentialItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["someurl.com"]
    start_urls = [
        "someurl.com"
]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseBasicListingInfo,  meta={'item': item})

def parseBasicListingInfo(item, response):
    item = response.request.meta['item']
    item = ListResidentialItem()
    try:
        image_urls = map(unicode.strip,response.xpath('//a[@itemprop="contentUrl"]/@data-href').extract())
        item['image_urls'] = [ x for x in image_urls]
    except IndexError:
        item['image_urls'] = ''

    return item

settings.py:

from scrapy.settings.default_settings import ITEM_PIPELINES
from scrapy.pipelines.images import ImagesPipeline

BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'

ROBOTSTXT_OBEY = True
DEPTH_PRIORITY = 1
IMAGE_STORE = '/images'

CONCURRENT_REQUESTS = 250

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {
    'scrapy.contrib.pipeline.images.ImagesPipeline': 300,
}

items.py

# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()

# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

    pass

我的管道文件是空的我不知道我想要添加到pipeline.py文件中。

非常感谢任何帮助。

5 个答案:

答案 0 :(得分:6)

由于您不知道要在管道中添加什么,我假设您可以使用scrapy提供的图像的默认管道,因此在settings.py文件中您只需声明它就像

ITEM_PIPELINES = {
'scrapy.pipelines.images.ImagesPipeline':1
}

此外,您的图片路径错误/表示您将转到机器的绝对根路径,因此您要么将绝对路径放到要保存的位置,要么只是从你正在运行你的爬虫的地方

IMAGES_STORE = '/home/user/Documents/scrapy_project/images'

IMAGES_STORE = 'images'

现在,在蜘蛛中提取网址,但不将其保存到项目中

item['image_urls'] = sel.xpath('//a[@data-tealium-id="detail_nav_showphotos"]/@href').extract_first()

如果您使用默认管道,则字段必须为image_urls

现在,在items.py文件中,您需要添加以下2个字段(这两个字段都需要这个文字名称)

image_urls=Field()
images=Field()

那应该有用

答案 1 :(得分:6)

我的工作结果:

<强> spider.py

import scrapy
import re
import urlparse
from scrapy.spiders import CrawlSpider, Rule
from scrapy.linkextractors import LinkExtractor
from scrapy.loader.processors import Join, MapCompose, TakeFirst
from scrapy.pipelines.images import ImagesPipeline
from production.items import ProductionItem
from production.items import ImageItem

class productionSpider(scrapy.Spider):
    name = "production"
    allowed_domains = ["url"]
    start_urls = [
        "startingurl.com"
    ]

def parse(self, response):
    for sel in response.xpath('//html/body'):
        item = ProductionItem()
        img_url = sel.xpath('//a[@idd="followclaslink"]/@href').extract()[0]
        yield scrapy.Request(urlparse.urljoin(response.url, img_url),callback=self.parseImages,  meta={'item': item})

def parseImages(self, response):
    for elem in response.xpath("//img"):
        img_url = elem.xpath("@src").extract_first()
        yield ImageItem(image_urls=[img_url])

<强> Settings.py

BOT_NAME = 'production'

SPIDER_MODULES = ['production.spiders']
NEWSPIDER_MODULE = 'production.spiders'
DEFAULT_ITEM_CLASS = 'production.items'
ROBOTSTXT_OBEY = True
IMAGES_STORE = '/Users/home/images'

DOWNLOAD_DELAY = 2

ITEM_PIPELINES = {'scrapy.pipelines.images.ImagesPipeline': 1}
# Disable cookies (enabled by default)

<强> items.py

# -*- coding: utf-8 -*-
import scrapy

class ProductionItem(scrapy.Item):
    img_url = scrapy.Field()
# ScrapingList Residential & Yield Estate for sale
class ListResidentialItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

class ImageItem(scrapy.Item):
    image_urls = scrapy.Field()
    images = scrapy.Field()

<强> pipelines.py

import scrapy
from scrapy.pipelines.images import ImagesPipeline
from scrapy.exceptions import DropItem

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        item['image_paths'] = image_paths
        return item

答案 2 :(得分:1)

只需在这里添加我的错误,这会使我震惊了几个小时。也许它可以帮助某人。

来自草率文档(https://doc.scrapy.org/en/latest/topics/media-pipeline.html#using-the-images-pipeline):

然后,将目标存储设置配置为将用于存储下载图像的有效值。 否则,即使将管道包括在ITEM_PIPELINES设置中,管道也将保持禁用状态。

出于某种原因,我使用了冒号“:”而不是等号“ =”。

    # My misstake:
    IMAGES_STORE : '/Users/my_user/images'

    # Working code
    IMAGES_STORE = '/Users/my_user/images'

这不会返回错误,而是导致管道根本无法加载,这对我来说很难解决。

答案 3 :(得分:0)

就我而言,是导致问题的IMAGES_STORE路径

我做了IMAGES_STORE = 'images',它的工作就像一个魅力!

以下是完整的代码:

设置:

ITEM_PIPELINES = {
   'mutualartproject.pipelines.MyImagesPipeline': 1,
}

IMAGES_STORE = 'images' 

管道:

class MyImagesPipeline(ImagesPipeline):

    def get_media_requests(self, item, info):
        for image_url in item['image_urls']:
            yield scrapy.Request(image_url)

    def item_completed(self, results, item, info):
        image_paths = [x['path'] for ok, x in results if ok]
        if not image_paths:
            raise DropItem("Item contains no images")
        return item

答案 4 :(得分:0)

您必须在settings.py文件中启用SPIDER_MIDDLEWARES和DOWNLOADER_MIDDLEWARES